Custom analysis over sweep

Similar to Import & Export Data to W&B, I am wondering if there is a way first to write multiple runs to a single artifact (as already, presumably, must occur when you look at a data table online for a sweep over multiple seeds), and then second download that artifact and export as say a panda data table for further analysis. Is this possible, and if so, how would I go about doing this?
If it is not possible, is it possible to run arbitrary statistics from within weave? It would be a fairly complex multi-stage regression that is implemented in Python in linearmodels/statsmodel, but would not be easy to do in say, vega-lite.

I am dumb, I just saw the ‘Querying Multiple Sweep’ subheader.

I guess with that my question simplifies to, is there a way to easily programmatically access just the subset of runs that are within a particular sweep in the entity/project?

Hello @bkaplowitz

Let me take a look into this and see if there is a way, will get back to you after I gather necessary information.

Hi @bkaplowitz ,

Yes, you can programmatically access only the subset of runs that are within a specific sweep in the entity/project using the Weights & Biases API. Here is an example of how you can do this:

python
import wandb

api = wandb.Api()
sweep = api.sweep("<entity>/<project>/<sweep_id>")
runs = sweep.runs

In this example, replace <entity> , <project> , and <sweep_id> with your entity, project, and sweep ID respectively. The sweep.runs will give you a list of all runs in the specified sweep.

Hope this helps and feel free to write again for any concerns.

Hi @joana-marie thanks so much! I guess now all I need to do is fetch the wandb.Table from each run and merge them somehow (unless sweep premerges.) Is there an easy way to do this? The Import & Export Data to W&B page goes over fetching particular objects like summaries but not artifacts like tables (or logs.)

Best,
Brandon

In particular, do you have a sense as to how I can easily fetch and merge the stored tables? I have the paths of each table, but they are run-id specific.

For example, one path is given as:

media/table/charts/ganong_noel_table_118289_367c07ac39c726f6f19e.table.json

or alternatively:

wandb-client-artifact://nez6nxdsng9l0yepei3lv0o9nlurqv0uhuc369evklx4ozp850iv6ard2h3god27ytx5fnf8qencih27gg3ps81kte4i3ivnavzd4zocpxpu6fzuir5awep5s8ww945u/charts/ganong_noel_table.table.json.

I need to download each of these for the latest version that has all values that were iteratively added during the run, and then merge them together across the different runs in the sweep.

What I am currently trying is something like:

sweep: wandb.sweep = api.sweep(f"{ENTITY}/{PROJECT}/le2zghgt")
current_analysis = wandb.init()

runs = sweep.runs

for run in runs:
    table_remote = current_analysis.use_artifact(
        f"{ENTITY}/{PROJECT}/run-{run.id}-{TABLE_NAME}:{TAG}", type="run_table"
    )
    table_dir = table_remote.download()
    table_json = json.load(
        open(f"{table_dir}/{CHART_DIR}/{TABLE_NAME_NO_DIR}.table.json")
    )
    table_wandb = wandb.Table.from_json(table_json, table_remote)
    table_local = pd.DataFrame(data=table_wandb.data, columns=table_wandb.columns)
    table_local["run_id"] = run.id
    tables.append(table_local)
table_ganong_noel = pd.concat(tables, ignore_index=True)
table_ganong_noel.to_parquet("ganong_noel_table.parquet")

but I have no idea if this is the best way to do it.

Hello @bkaplowitz ,

Let us take a look the details/code you provided and will get back to you.

Hello @bkaplowitz !

I looked through your code and this looks like a solid way to do it. This looks like it would be the fastest way to download and append all the tables into one. Have you had success using this script?

I have, and the download is now going smoothly. I had to modify things slightly to enable parallel downloads, so I used concurrent.futures.

In case this arises as an issue in the future for someone, the final code looks like:

import argparse
from pathlib import path
import pandas as pd
import json
import wandb
from tqdm import tqdm
import concurrent.futures

wandb.login()
api = wandb.Api()
# ENTITY = ...
# PROJECT = ...
# SWEEP_ID = ...
# TABLE_NAME = ...
# TAG = ...
# CHART_DIR = ... [remote path to charts]
# TABLE_NAME_NO_DIR  = ... [remote table name as found in artifacts tab of run/sweep]

sweep: wandb.sweep = api.sweep(f"{ENTITY}/{PROJECT}/{SWEEP_ID}")
current_analysis = wandb.init()
runs = sweep.runs
tables=set()

def download_table(run):
    try:
        run_name = f"run-{run.id}"
        table_remote = current_analysis.use_artifact(
            f"{ENTITY}/{PROJECT}/{run_name}-{TABLE_NAME}:{TAG}", type="run_table"
        )
        table_dir = table_remote.download()
        table_json = json.load(
            open(f"{table_dir}/{CHART_DIR}/{TABLE_NAME_NO_DIR}.table.json")
        )
        table_wandb = wandb.Table.from_json(table_json, table_remote)
        table_local = pd.DataFrame(
            data=table_wandb.data, columns=table_wandb.columns
        ).copy()
        table_local["run_id"] = run.id
        return table_local
    except Exception as e:
        print(f"Error downloading table for run {run.id}: {e}")
        return None


def download_tables(runs):
    tables = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
        futures = [executor.submit(download_table, run) for run in runs]
        for future in tqdm(
            concurrent.futures.as_completed(futures), total=len(futures)
        ):
            table_local = future.result()
            if table_local is not None:
                tables.append(table_local)
    return tables


def cache_tables(tables):
    # Cache the downloaded tables locally
    for i, table in enumerate(tables):
        table.to_csv(f"table_{i}.csv")


def load_cached_tables():
    tables = []
    files_exist = []
    for i in range(len(runs)):
        file_path = Path(f"table_{sweep.id}_{i}.csv")
        file_exists = file_path.is_file()
        files_exist.append(file_exists)
        if file_exists:
            table = pd.read_csv(f"table_{sweep.id}_{i}.csv")
            tables.append(table)
    return tables, all(files_exist)


# Check if tables are already cached
tables, cache_success = load_cached_tables()
if cache_success:
    print("Tables loaded from cache.")
else:
    tables = download_tables(runs)
    cache_tables(tables)
    print("Tables downloaded and cached.")

table_ganong_noel_backup = pd.concat(tables, ignore_index=True)

Uploading is still a bottleneck, since at most 3 instances I think can run in parallel before the server returns an error and the upload time delays sequential runs (right now each run takes under half a second). I’m effectively ‘hacking’ the sweep config to do monte carlo runs for different seeds + regular sweeping over params. So this is like 400+ seeds.
Launching this automatically has also been a bit of a problem as when I use the subprocess library to launch the command line version, a completion signal is returned on the initial launch of wandb run, so I immediately jump to the next step.

Perhaps there is an easier way to do all of this within the python api for sweeps instead?

If you wanted to do this within the UI, you could also do this within Weave then you could Export as CSV. Are all your tables named the same? If so, the query would be as simple as runs.summary["NAME_OF_TABLE"]. This way of doing it wouldn’t be as direct as downloading the tables but you will be able to merge the names of tables within the same project and do analysis on it.

We have a list of Weave Queries that could help you out to do analysis on the wandb side of things as well.

Hi Brandon, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Sorry I just saw this. The idea of doing it in weave is pretty nice! Thanks! The primary bottleneck at the moment is on the uploads side, not the downloads/analysis which actually runs pretty fast.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.