I am trying to make wandb work with Azure for versioning my datasets.
My dataset is too big for any upload, so I am keeping it in Azure and add it by reference.
I am using the file based reference (file:///) for a folder that is mounted to the compute instance.
Registering the dataset, checksumming it etc all works fine.
My problem is now how I USE the artifact.
Since the folder is mounted by azure using a randomly generated name each time I cannot use the stored reference name. What I am doing right now is using the keys of the manifest entries:
artifact.manifest.entries.keys()
This gives me all the filenames and I manually concat it to the mounted folder pathname.
Is there a better, less hacky, way of doing it? (Or even a better way to use Azure, since wandb supports s3 and gc?)
.download()
is no option since the dataset is to big and the mounted folder is fine. .checkout()
does not work, since the folder also contains other files which I do not want to delete. .get()
and similar also dont work since I dont know the file paths.
In my ideal world I would just have a function artifact.files(root="mount_path", verify=True)
which returns a list of all filenames and verifies they are correct via checksum. So I can just use the dataset and be sure it is the same one.
Thank you! Artifacts are such a great addition to wandb and I would love to use them