We are new to WandB and working out best-practices for using referenced-artifacts.
We have an S3 bucket were we keep our data corpus, so that it can be shared between machines.
We want to use WandB to track these files and to use it to download/synchronize copies of the datasets to local machines.
There seem to be two ways to add files to an artifact:
- By-Group: Create the file locally and put it to S3. Repeat with all other data files until done. Then use the
artifact.add_reference()
command and point it to the S3 prefix/directory for the files. This will add the directory and its files to the artifact. The artifact will report that only a single “file” exists (since we only added the directory) – which I think is weird, by the way – but all the files seem to be there. - One-by-one: create the file locally, put it to S3 and immediately add the S3 file to the artifact. Repeat until all files are done and then close the artifact and the run. The artifact will now properly note that n-files have been added.
The real question is, when I later execute a download(root=my-local-path)
operation, will I be able to cleanly load the files from the artifact to my local directory. That is, without having to fight a path mismatch between the S3 paths and my local paths.
That is, if the S3 path is: /really/deep/s3/path/to/my/dataset/files
And my local path is: /Users/user/
Can the files end up here: /Users/user/dataset/files
Thank you,
Kevin