I have a task that consists of four prompts, so one piece of text gets evaluated four times regarding these different questions. Currently, I make my Weave model return a list with the results of each prompt at the end (so a list with four items), and then my scorer function checks if every item in that list matches the individual observed values (so whether the first item (=prediction) matches the corresponding observed value, the second item in the list corresponds to that item’s observed value etc). This is helpful for me because I can see which examples are evaluated perfectly overall.
But it is also very clunky because I don’t evaluate each prompt separately, but always all four together and that causes a lot of overhead. Can I do both somehow? So have models for each prompt, and then also a model that combines all of them? Or is this combination an antipattern anyways? Would be interested in learning more about best practices here. Thanks!