Using e.g. Experiment Service, with interface which was defined in declarative way, developers had possibility to publish to cluster Experiment definition with: test set(list of pairs <question, expected answer>), runs(list of app config with different LLM provider, model, prompt, image version), index etc and based on it, get report as a result -> this way Development team don't need to e.g. know how we are evaluating LLM outputs and complexity is hidden
another case is of course model monitoring -> the same business case which worked 3 months ago on model from provider x does not necessary will work the same now ^^