Join over different tables in a run

Hello,

I am looking at this example where at each epoch a table is generated to represent a dataset (images, ground truth) along with the model prediction and is then logged to be able to visualize the model prediction at every epoch.

It looks redundant and bandwidth-hungry to log the images at every epoch. I would like to have a way to log the dataset as a table only once with the columns (id, image, ground truth), then at every epoch log only a table with the model predictions i.e. with columns (id, prediction), then on the UI join the two tables on the “id” key.

This does not seem to be possible at the moment. Has anyone tried something similar? Is it really standard to log a whole dataset at every evaluation step?

Thanks!

There is a way to log images just once. Basically, you log a table without the model predictions and then log a new table that references these images. Actually the integrations with lightning and keras do this.

Basically, you do this in 3 steps.

  • Log a Table into an Artifact
at = wandb.Artifact("evaluation_data", type="data") 
ds_table = wandb.Table(columns = ["image", "label"], data=data)
ds_at.add(ds_table,  "dataset_table")
wandb.log_artifact(at)
  • then you grab this artifact and recover the table:
at = wandb.use_artifact("evaluation_data", type='data')

# grab the ds table
ds_table = at.get("dataset_table")
index = ds_table.get_index()
  • Finally, you create a new Table and reference (index) the values from the referenced table.
# create a new predictions table
preds_table = wandb.Table(columns=["image",  "label", "predictions"])

# then we fill the new table with the values from the `ds_table`
for idx in index:
  pred = preds[idx]
  row = [ds_table.data[idx][0], ds_table.data[idx][1], pred.argmax()]
  self.preds_table.add_data(*row)

# finally we log the new predictions table to a new Artifact
pred_artifact = wandb.Artifact(f"run_{wandb.run.id}_preds",  type="evaluation")
pred_artifact.add(preds_table,  "model_predictions")
wandb.log_artifact(pred_artifact)

It is pretty verbose, but it keeps track of the lineage.

Hey @skandermoalla, it’s possible to log the dataset only once, and for subsequent epochs, use referencing to access the logged dataset. Thus you need to upload the dataset only once.

It’s already used in MMDetection, MMSegmentation, MMClassification and new W&B Keras Eval callback: