In the project we have a multi-stage pipeline where each stage has a set of hyperparameters. It looks something like that:
preprocesing -> vectorization -> clustering
For sake of simplicity, let’s assume that each stage has a single hyperparameter:
preprocesing: cutoff_threshold vectorizaiton: model_name clustering: num_clusters
Currently each stage logs everything in a separate run. All stages from a single run are grouped into a single group .
At the moment we are doing hyperaparemeter sweep naively and we end up running whole pipeline 8 times (2x2x2) and using grouping feature we can nicely compare how changing hyperparameters of each stage affected end results.
However preprocessing and vectorization steps are quite compute-intensive. In theory, in described setup we would only need to run preprocessing twice and vectorization 4 times (instead of 8 runs of each one). However then we can not (or at least i can’t think of a way) group such runs so that we can get a nice sweep view. I.e. whe can not inform wandb which run of
clustering was based on which run of
I wonder if what is the best way to setup such a pipeline in wandb?