Workflow for running an ensemble of experiments with different initial conditions

Any ideas on the right workflow to run sweeps/groups with a whole bunch of different variations on initial conditions to see an ensemble of results? I think that the Group Runs - Documentation seems a natural candidate for this but I am not sure the right approach or how it overlays with sweeps in this sort of usecase.

To setup the scenario I have in mind: I have a script I want to run hundred times on my local machine with pretty much all parameters fixed except the neural network initial conditions. I can control that by doing things like incrementing a --seed argument or just not establishing a default seed. After running those experiments, it is nice to see pretty pictures of distribtions in wandb but I also want to be able to later collect the results/assets as a group. and do things like plot a histogram of val_loss to put in a research paper.

Is the way to do this with a combination of sweeps and run_groups? Forr example, can I run a bunch of these in a sweep with after setting the WANDB_RUN_GROUP environment variable? For example, maybe setup a sweep file like

program: train.py
method: grid
parameters:
  seed:
    min: 2
    max: 102

Where --seed is used internally to set the seed for the experiment? Any better approaches

If that works, , then do I just need to set WANDB_RUN_GROUP environment variable on every machine that I will run an agent on and then it can be grouped? Then I can pull down all of the assets for these with the WAND_RUN_GROUP? I couldn’t figure it out from the docs how to get all of the logged results (and the artifacts if there are any) for a group.

Hi @jlperla thank you for writing in! It’s indeed feasible to group the sweeps with various ways. If you name your Sweeps (by passing a distinct value to the name argument) then in the UI you would be able to Group by Sweeps. Alternatively you can use the group argument in the wandb.init() call (or the WANDB_RUN_GROUP as you mentioned). Another workaround to group the Sweeps (and Runs in general) but not the most recommended for your case here would be to make use of Tags. Finally you can also make use of your config to group the Runs as in here.

Have you tried any of these solutions, and did you run into any issue? or what would be your preferred way and we can further look into this option? Would you prefer to have your Sweeps preconfigured from your script, or being able to do so afterwards in the UI by their config values?

Thanks. That is very helpful. What ended up working really well was a sweep file such as sweep_ensemble.yaml

program: my_script.py
project: my_project
method: grid
parameters:
  seed:
    min: 10
    max: 50

Called with wandb sweep --name my_sweep_name sweep_ensemble.yaml

Where I am using pytorch lightning’s LightningCLI for running my experiments and ---seed=15 etc. will change the seed then run it. I can then set the tags argument in in WandbLogger — PyTorch Lightning 1.7.5 documentation if it helps for searching.

Different variations seem to work well for calling. Where I am getting a little more stuck is the best way to programmatically get all of the details of the sweep with a query in the python interface.

I can get the list of sweeps but I can’t figure out how to do that by my chosen display name (i.e., my_sweep_name above) as opposed to the underlying sweep identifier. Is there a way to use runs = api.runs("myentity/my_project", filters= ???) or something like that which will get the list of runs given a display name which I can then query for summary, config, and artifacts

Alternatively, if tags are easier and I set one up, is there an easy way to get all of the runs which have a particular tag?

Hi @jlperla thank you for the detailed information, and great to hear that the grouping issue has been now resolved. Regarding your question using the API to filter runs, you could do that indeed with the following command:
runs = api.runs("entity/project", filters={"sweep": "sweep_id"})
Alternatively you can use API to tag all your runs based on my_sweep_name identifier and then query runs as follows:
runs = api.runs("entity/project", filters={"tags": "my_sweep_name"})
Is my_sweep_name defined in your config? In that case you could do filters={"config.sweep_name": "my_sweep_name"}.

Would any of these work for you? Please let me know if you have any further questions or issues with this!

1 Like