What is the best way to log multi-stage pipelines?

Hi,
In the project we have a multi-stage pipeline where each stage has a set of hyperparameters. It looks something like that:
preprocesing -> vectorization -> clustering

For sake of simplicity, let’s assume that each stage has a single hyperparameter:

preprocesing: cutoff_threshold
vectorizaiton: model_name
clustering: num_clusters

Currently each stage logs everything in a separate run. All stages from a single run are grouped into a single group .

At the moment we are doing hyperaparemeter sweep naively and we end up running whole pipeline 8 times (2x2x2) and using grouping feature we can nicely compare how changing hyperparameters of each stage affected end results.

However preprocessing and vectorization steps are quite compute-intensive. In theory, in described setup we would only need to run preprocessing twice and vectorization 4 times (instead of 8 runs of each one). However then we can not (or at least i can’t think of a way) group such runs so that we can get a nice sweep view. I.e. whe can not inform wandb which run of clustering was based on which run of vectorization and preprocessing

I wonder if what is the best way to setup such a pipeline in wandb?

Thanks,
Michał

3 Likes

Thanks for joining the forum and posting this great question, this is the exact type of thing we had in mind for this category.

I think I need to clarify things to properly understand your question. Are you hoping for sweeps to help you with ablation studies, doing conditional steps based on earlier parts of the pipeline? I wonder if you could use tags to help with this, so that each run doesn’t have to be only in one collection.

I’m going to ask around and have a think myself to see if we can find a way to do this nicely, otherwise I can +1 this to make it into the product roadmap to support these kind of efforts.

Hi Scott, thanks for your quick reply and sorry for such a slow response on my part

Yes, you pretty much spot on wha I would like to do. I have multiple steps that are only conditional on the previous steps (a fairly typical data processing + training DAG). More than one step in such a DAG is parameterized, and there is no obvious way to tell in advance how the parameters from each step interact and how the composition of specific hyperparameter values affects the final model performance. Ideally I would like to somehow have a summarised view of all the hyperparameters combinations used in the pipleine in a table, sort by the final metrics score and then see what I should focus on and what not.

Currently the only ways I can think of to do it is to run the whole pipeline each time and then group the results or log hyperparametrs for each stage offline (eg. to a file) and pushing it to wandb only on the final stage. Both are not ideal, as the first one is quite slow and the second one seems cumbestone (however it does work)

I also thought about somehow using tags, however, If I undersetand correctly, tags can only be using for filtering. So if I have a 3-stage grid search of 2x2x2, In the table/sweep view I would like to have 8 entries, however if I can only filter (reduce number of runs), I would end up with maximum of 2 entries (as I have only 2 runs of first stage)

1 Like

An option here would be to use W&B Artifacts as a way to capture the state of expensive parts of the pipeline. You could build logic that would save each combination of the steps, and then you could consume those artifacts if that ablation had already occurred. This is an interesting idea which I haven’t fully explored but I think it would be a really useful approach that would stop a lot of redundant computation.