Use W&B in a Jupyter notebook to load a dataset

Hi all,

after a few years working in the field and suggesting people to try W&B, I’m excited to finally get to use it myself :grinning:

As part of a remote team, I’m doing an EDA (Exploratory Data Analysis) in Jupyter. We’re storing the dataset as a W&B artifact, and I need my notebook to download the dataset locally, so I wrote something like:

import wandb

artifact_file = "my_entity/my_project/my_dataset:v0"
data_dir = Path('.').parent / 'data'

# Download data from W&B
data = wandb.use_artifact(artifact_file)
data.download(root=data_dir)

However, when I run the cells I get the error:

Error: You must call wandb.init() before wandb.use_artifact()

Two questions:

  1. how do I fix this? Would something like this suffice?
run = wandb.init(
        reinit=True,
        project="my_project",
        entity="my_entity",
        group="eda",
    )

# Download data from W&B
data = wandb.use_artifact(artifact_file)
data.download(root=data_dir)
  1. Since I called wandb.init(), I guess I should call run.finish()at the end of my EDA, otherwise the background process will run forever (or more realistically until some timeout). Now, in the usual training script, where all the code has been written and debugged before I launch the wandb background process, this would be easy: I would just add the run.finish() line at the end of the script. Here however I edit and add code while I continue with the analysis (it’s Jupyter). So what’s the best practice? Do I just go on with my analysis and add a run.finish() line in the last cell? Or do I call run.finish() immediately after downloading the data to the data_dir? In other words, I know the standard workflow for using W&B logger and artifacts in non-interactive mode (Python scripts), but I’m not so familiar with the W&B workflow for interactive analyses (Jupyter notebook). Can you help me? Thanks,

Andrea

1 Like

Great question @andreapi.

Yes, Artifacts are connected to Runs in wandb – that’s what lets us give you that nice graph showing which run used which artifact.

Your solution works, and the data will be available locally at data_dir.

In terms of best practice, it depends on how you’re using W&B with your EDA.

If your EDA is not logged to W&B, that is if you’re just using Artifacts to store versioned data, rather than versioned analysis results, then you should close the run immediately. I would make sure you create run with the job_type argument to wandb.init set to download or something like that.

If your EDA is logged to W&B, that is if you’re also going to be tracking things you do inside the EDA to W&B, then you should wait to close the run until you’re done with the analysis you’ll be tracking. This has the benefit of actually logging (if you have code saving turned on!) all the cells you run and their outputs, which has saved my :bacon: when doing EDA and other prototyping in a notebook. You can also log charts and media and dataframes/tables while you do your EDA. These features are in very active development, so we’d love to hear how they work for you and your team!

1 Like

Hi Charles,

thanks for the suggestion! In meantime I found a better solution where I don’t even need to start a run. Look ma, no init! :stuck_out_tongue_winking_eye:

artifact_URI =...
data_dir = ...

api = wandb.Api()
artifact = api.artifact(artifact_URI)
artifact.checkout(data_dir)

Not bad, eh? All thanks to your very good documentation and the very responsive wandb/client repository!

3 Likes

:tada: ! Indeed, you can do a lot via the public API, which avoids the need to make runs. It’s convenient and allows you to do things like change the history to correct errors.

I think, btw, you can/should use artifact.download instead of checkout. The difference is the same as that between just downloading the files (which is non-destructive) and doing a “checkout” in git – which guarantees that the state of the directory locally is exactly the same as it is in the remote, meaning files will get deleted/clobbered. Depends on your workflow, but most folks want download.

One more plug: this works to pull down artifacts for use without logging anything to W&B, but if you want your EDA tracked (including the session history of your notebook, as I mentioned above), you’d need to treat the EDA as a “run”. If you track it and log your results with us, you could share what you find as a Report!

1 Like