How to replay prompts and evaluate multiple model output quality with past runs?

Hi, I have used wandb Tracer to log basic prompt input and output, each query is "a run " for now. I’d like to replay the prompts with a differnt model and group the outputs to the same run.

Is this straight-forward to do? How should I update a run with a new Tracer result?

Thank you.

Hi
Thanks for your question. This is a use case we’re working hard to support and have some really exciting things in the works to make this easy.

Today, this is a bit involved because of your traces being logged within runs. We’ll make it easier in future to save a set of prompts and use them on LLMs to get traces. This will be powered by a new toolkit we’re building called Weave.

To use the wandb.Api to get your runs:

import wandb
import json

api = wandb.Api()
runs = api.runs("yudixue/<project name>")
for run in runs:
    root_spans = json.loads(run.summary['langchain_trace']['root_span_dumps'])
    input = root_spans['results'][0]['inputs']['input']

To resume a run, you can do

wandb.init(id=run.id, resume='must')

using the run object from the above code, and log to it as normal.

Although I don’t necessarily see a problem with this, you should be ok to create a new run and log to your table.

Hope this helps.

1 Like

Hi @yudixue, I wanted to follow up and see if you had a chance to try out Scott’s suggestion or if there was anything else we could help answer?

Thanks Scott Weave sounds exciting!

Also if I want to make current API work and log new “root_span_dumps” (let’s call a new <model_name>_span_dumps), I can just add to run summary, deserialize and replace it?

You should be able to use the Tracer as normal to log more traces for the resumed run.
You might be better off just logging new runs rather than resuming though.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.