Dataset artifact organization

Typically pre wandb my approach to organizing dataset was to have lots of subfolders -

mnist
     complete
          augmented-mild
          augmented-heavy
     sampled-examples
          mnist-1000
               augmented-mild
               augmented-heavy
          mnist-10k
              augmented-mild
              augmented-heavy
   sampled-class-examples
        mnist-1000-5cls
        mnist-10k-5cls

On going through wandb artifacts docs, it seems it is best to have a flattened structure for dataset versioning. How much flattening is ideal? A complete flattening would mean each of those above to have a different name and same type(say “balanced-dataset”).Completely flattening dataset hierarchy seems to take away the “versioning” ability of wandb as now all of them are different artifacts.

One option you can do if you want each stage of your preprocessing and splitting to be versioned by Artifacts is create new Artifact for each stage. You would have one complete dataset Artifact, and then one for each split of your dataset. You would use run.use_artifact("your_artifact_name:latest") to download the complete dataset, then log new Artifacts after you’ve split it.

Here’s a W&B Report about that approach: Weights & Biases