Prompt engineering: add extra metrics to openai GPT autolog?

Hi all!

I want to use WandB for prompt engineering of calls to openai’s chat-completion. Each job is an attempt at a new prompt. That prompt is then “looped” loop over multiple documents to call openai’s ChatCompletion and analyze each document, and I compute a summary metric at the end over the summaries of the documents in the validation set.

How can I track both all the calls to ChatGPT (ideally by using autolog) and also store extra info regularly?

I would like to:

  • store all calls to openAI API (hence autolog would seem perfect)
  • but also log for each document a set of information which are not passed to the openAI API, such as: metadata on the document, and parsed answers from ChatGPT (e.g. with manual calls to log). There might also be multiple openAI API calls by document.

I’m worried that manual calls to wandb.run.log will interfere the calls to log from autolog.

Thanks for any help either solving this, or showing me why I don’t need it :slight_smile: !

Ideally, being able to log to two separate tables (one for the autolog, one for the other logs) as I loop through the items would be great.
Even better would be to, in addition, add custom fields to the autolog (so that I can then join the two Tables).
Any idea?

To clarify, after discussing with Thanos from WandB:

Basically, pseudo-python would be like this:

autolog()
for url, text in documents.items():
   # We would like all the calls to OpenaI Chatcompletion to be logged.
   # Note that the call_gpt function might actually call ChatCompletion 
   # several times if the text is too long to fit into one single context.    
   # Therefore its internal use of `wandb.log` might increase the step by more 
   # than once. That's OK, as we do not have a "training step" per se, so we can be
   # flexible with the meaning of "step".
   answer = call_gpt(basic_system_prompt, text)
   parsed_answers = parse(answer)
   accuracy = compare_to_ground_truth(url, parsed_answers)
   # and now we also log all the metadata and the results
   wandb.log(dict(url = url, text = text, parsed_answers = parsed_answers, accuracy=  accuracy))

It could work if there would be either:

  • a way to either log to two distinct tables (one for the autolog, one for the manual log),
  • or to one big table that is filterable
  • or to add custom fields to the autolog (although since multiple Chatcompletion calls might be autologged for one url, that might be tricky)
  • or any other solution I don’t think of :slight_smile:

Thanks for any help!

1 Like

You can log custom columns to your Trace table by calling

wandb.log({'<column_name1>': '< some_metadata >', 
           '<column_name2': '<some_more_metadata>'}, commit=False)

before you call the chat completion API (or before logging the Trace object if you’re using that).

Thanks Scott. This works for the URL metadata, to an extent. There are still two things I don’t know how to do;

1/I think it breaks when call_gpt() needs to do two or more calls to Chat completion, e.g. when the prompt+text to analyze doesn’t fit into a single context. In such a case, the metadata would likely need to be duplicated across the two completion calls. I could do that a bit hacky by passing the metada to call_gpt() and adding a WandB.log call in the inner loop there before each call to Chat completion - but it would be more elegant to not touch that function, ideally.

2/ More problematic: How can I log with it the info that come after the call to chat_completion, please? I am thinking of the parsed answers, in particular. Is there a “commit=False” option in the autolog, that could defer the actual commit until manually called, for example?

for 2, to wait and add metadata after calling the chat completion API, you’d need to log to wandb yourself rather than using autolog.

To do this, you import the Trace class and create the object with your chat data after you call it.

from wandb.sdk.data_types.trace_tree import Trace

root_span = Trace(
    name="root_span",
    kind="llm",  # kind can be "llm", "chain", "agent" or "tool"
    status_code=status,
    status_message=status_message,
    metadata={"temperature": temperature,
                "token_usage": token_usage, 
                "model_name": model_name},
    start_time_ms=start_time_ms,
    end_time_ms=end_time_ms,
    inputs={"system_prompt": system_message, "query": query},
    outputs={"response": response_text},
    )
  
# log the span to wandb
root_span.log(name="openai_trace")

Here’s a more complete example:

To solve 1:
You can choose to chain multiple chat_gpt calls and add them as child spans of one parent span. This is how we log chains in langchain for instance. This shows up in W&B as one row in the Trace Table, with each of the calls to chat_gpt visible as spans within the trace.

import time
import datetime
from wandb.sdk.data_types.trace_tree import Trace
import openai 
import wandb
wandb.init(project="custom-chained-trace-example")

# boilerplate for openai

model_name='gpt-3.5-turbo'
temperature=0.7
system_message="You are a helpful assistant that always parses the user's query and replies in 3 concise bullet points using markdown."
docs = ['This is a document about cats. Cats are furry animals that like to eat mice. Cats are also very independent animals.']

def call_gpt(model_name, temperature, system_message, query):
    messages=[
      {"role": "system", "content": system_message},
      {"role": "user", "content": query}
    ]
    response = openai.ChatCompletion.create(model=model_name,
                                        messages=messages,
                                        temperature=temperature)   

    llm_end_time_ms = round(datetime.datetime.now().timestamp() * 1000)
    response_text = response["choices"][0]["message"]["content"]
    token_usage = response["usage"].to_dict()
    return llm_end_time_ms, response_text, token_usage

def chunk(docs, chunk_size=1):
    for i in range(0, len(docs), chunk_size):
        yield docs[i:i + chunk_size]

# logic to create a trace for each doc

for doc in docs:
    start_time_ms = round(datetime.datetime.now().timestamp() * 1000)
    # Create a root span to represent the entire trace
    root_span = Trace(
      name="LLMChain",
      kind="chain",
      start_time_ms=start_time_ms)
    for chunk in chunk(doc, chunk_size=100):
        doc_query = "Parse this doc: " + chunk
        start_time_ms = round(datetime.datetime.now().timestamp() * 1000)
        llm_end_time_ms, response_text, token_usage = call_gpt(model_name, temperature, system_message, doc_query)
        # Create a span to represent each LLM call
        llm_span = Trace(
            name="OpenAI",
            kind="llm",
            status_code="success",
            metadata={"temperature":temperature,
                        "token_usage": token_usage, 
                        "model_name":model_name},
            start_time_ms=start_time_ms,
            end_time_ms=llm_end_time_ms,
            inputs={"system_prompt":system_message, "query": doc_query},
            outputs={"response": response_text},
            )
        root_span.add_child(llm_span)
    # update the end time of the Chain span
    root_span.add_inputs_and_outputs(
        inputs={"query": doc},
        outputs={"response": response_text})

    # update the Chain span's end time
    root_span._span.end_time_ms = llm_end_time_ms

    # add metadata to the trace table
    accuracy = 0.7
    wandb.log({"accuracy": accuracy}, commit=False)

    # log all spans to W&B by logging the root span
    root_span.log(name="docs_trace")

Docs here:

Hey @jucor! Just checking if the above message from Scott would solve your issue?

Hi Luis! Yes, speaking with Scott here and on zoom was very helpful. This does not quite solve what I need, but has given me enough understanding to advance on my way, with a mix of this and home code.
I’ll update here once I’ve finished my workflow.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.