Dataset artifact organization

jasdeep06 · December 9, 2021, 9:07am

Typically pre wandb my approach to organizing dataset was to have lots of subfolders -

mnist
     complete
          augmented-mild
          augmented-heavy
     sampled-examples
          mnist-1000
               augmented-mild
               augmented-heavy
          mnist-10k
              augmented-mild
              augmented-heavy
   sampled-class-examples
        mnist-1000-5cls
        mnist-10k-5cls

On going through wandb artifacts docs, it seems it is best to have a flattened structure for dataset versioning. How much flattening is ideal? A complete flattening would mean each of those above to have a different name and same type(say “balanced-dataset”).Completely flattening dataset hierarchy seems to take away the “versioning” ability of wandb as now all of them are different artifacts.

_scott · December 9, 2021, 10:50am

One option you can do if you want each stage of your preprocessing and splitting to be versioned by Artifacts is create new Artifact for each stage. You would have one complete dataset Artifact, and then one for each split of your dataset. You would use run.use_artifact("your_artifact_name:latest") to download the complete dataset, then log new Artifacts after you’ve split it.

Here’s a W&B Report about that approach: Weights & Biases

system · April 20, 2022, 6:02pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best Practices for WandB Artifacts W&B Help artifacts	4	762	February 10, 2023
Artifacts logged with run_id W&B Help artifacts	4	1021	September 27, 2022
Only a subset of the artifacts exist issue W&B Help artifacts	4	526	February 17, 2022
How do I get the version of an artifact? W&B Help artifacts , wandb	3	1299	May 7, 2022
Uploading basic data only once with wandb W&B Help	4	306	March 18, 2022

Dataset artifact organization

Related topics