I have a large dataset with multiple files on my local file system that I would like to track. It is not part of a github repository, since the files are quite large (it is around 30 GB, each file being 0.5 GB).
I added references to these files in W&B using the command
Now, if I change these files and log them in a run, I can see that the version of the artifact changes on the web UI. In the future, if I want to use an older version of this dataset, is there a way to do so?
I’m assuming not because W&B is only tracking the references, so there is no way of going back to the old dataset.
Hi @chaitanya-kolluru, thanks for your question! When using reference artifacts, we only keep track of the metadata associated with the files and not the files themselves so, if your bucket has object versioning enabled, we will retrieve the object version corresponding to the state of the file at the time the artifact was logged. This means that as you evolve the contents of your bucket, you can still point to the exact iteration of your data a given model was trained on since the artifact serves as a snapshot of your bucket at the time of training. Please let me know if this is helpful and don’t hesitate to ask any other questions you may have!