Best practices for many quick runs?

I have a project where I am doing many, many runs across seeds none of which take a particularly long amount of time and for all of which I would like to log metrics (both the individual run metrics and the group metrics are relevant for me). Unfortunately my compute environment is such that I must run W&B in offline mode (compute nodes are not connected to the internet), and as a result I have found sync to be an extreme bottleneck in my work. Has anyone encountered this kind of issue before and come up with a way to deal with it?

Hi @evanv , thanks for writing in. We’re looking into this for you.

Hey @evanv,

I’m sorry to hear that the offline to online sync for wandb has been acting as a bottleneck for you. Would you be able to share more context on what commands / a minimal example to reproduce the issues you’ve been running into?

In the meantime have you tried adjusting the arguments to wandb sync (wandb sync - Documentation) to send batches of runs in at a time? Using glob patterns or cleaning you can reduce the load needed when syncing all runs at once

Hi @a-sh0ts , sure thing!
A minimal example for what I am doing looks like this:

def do_hyperparam_search():
    configs = [
        {'lr': lr, 'lambda': lmbda, 'model_type': m}
        for lr in [1, 0.1, 0.001]
        for lmbda in [1, 10, 100]
        for model_type in ['foo', 'bar']
    ]
    for config in configs:
        run_config(config)


def run_config(config):
    for seed in range(100):
        do_wandb_run(seed, config)

def do_wandb_run(seed, config):
    torch.manual_seed(seed)
    wandb.init(config=config, mode='offline')
    for epoch in range(100):
        for cls in range(5):
            wandb.log({'some_metric_for_class': value}, step=epoch)
    wandb.finish()

I have a particular algorithm I am testing which is only sometimes convergent, so it is necessary to run it over many different seed and view both the average behavior and the variance in this behavior. The issue I am running into is that doing even just one run with a particular config produces hundreds of runs, and testing over different hyperparameter configurations multiplies the issue. Unfortunately my compute nodes are all offline so I have to manually synch all my runs. Although I have been using wandb sync --sync-all, when I have thousands of runs that also becomes untenable. Is there a better way I can run these experiments?

Hi @evanv!
You could try using W&B Sweeps and use the bayes/random search strategies. In these cases, you wouldn’t be searching the entire space but you would get a good picture of the search landscape without needing to. That would then reduce the number of runs you’d have to sync with wandb.

Hey @_scott , thanks so much for the advice! I have been wanting to integrate Sweeps in for a while but was not quite clear how they would work–if I create a sweep will it only require me to synch one file (for the sweep) or all the runs associated with the sleep. The primary multiplicity in my code comes from the fact that I have to run 100 seeds per configuration.

You’ll still have to sync your runs if you want to view them in W&B.

The primary multiplicity in my code comes from the fact that I have to run 100 seeds per configuration.

Wow, that’s a lot of seeds. I’ll move this back into “W&B Best Practises” and hopefully someone in the community has seen this and can give some advice. You could also try wandb local which would mean you have your own self-hosted W&B, this would be a require a bit of upfront time investment but would likely speed up that syncing bottleneck.

2 Likes

Got it. Thanks for the advice. I will look into wandb local to see whether it is tenable to set it up on my system. If anyone else has advice it is welcomed too :slight_smile:

1 Like

Hey @evanv!

@_scott 's advice using wandb sweepsand/or wandb local should have hopefully helped with some of your issues with logging/syncing large volumes of runs in an offline setting. The engineering team is aware of the problems you’re facing and would love to hear suggestions on behavior around this!

Hey @a-sh0ts , thanks so much for making the engineering team aware! I have been thinking a bit about what optimal functionality would be for me. On a day to day, I mostly care about some kind of statistic related to the seeds I am collecting (mean, median, max, etc…) as well as standard errors of this statistic across runs. Perhaps setting up functionality to store only these statistics, rather than all the information across all runs, would allow for efficient storage and processing of the data?

1 Like

Hey @evanv,
In the case where you want to limit the data you’re logging, you could aggregate that data locally and only log the summary metrics to W&B for comparison.
You would do this by just setting

wandb.run.summary['my_metric'] = the_final_metric_you_care_about

I’m aware that leaves a bit of work for you to do on your end. You would replace your wandb.log calls to only log to a list or np.array, then you would do the different calculations yourself and log them to wandb.run.summary after the run is complete.
Hope this helps

Hey @_scott ,

Thanks so much for the suggestion! Unfortunately I am not sure that would solve the issue I am dealing with since the bottleneck is really the uploading of so many runs files (each of which is relatively small) to wandb. Currently what I do is run all seeds and aggregate locally and then upload the aggregate of all runs to wandb, but unfortunately this loses the level of distributional information about statistics that wandb provides when i upload all the runs. Another (maybe better?) solution to this problem could be creating a method to do batch uploading of wandb runs rather than having to synch each one individually. Is there a functionality like that which already exists?

That exact function doesn’t already exist, but you can use --include-globs to do it yourself batch by batch. This isn’t ideal but could potentially speed up your syncing.

Here’s an example command for that, within wandb:

wandb sync --include-globs offline-run-20211208_15*

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.