We are running long data preparation run (30+ hours) to pre-build source files for training. However, part of the dataset was not ready and was excluding from the current run (which is 20+ hours into the run). I would like to process the remaining data and ADD it to this current artifact.
I note that whenever I run this code is creates a new version of the artifact.
How can I append new data to an existing artifact?
Second question: Can I add new data in-parallel with the original job. That is, can two different processes add data to the same artifact at the same time?
Hi @kevinashaw , thank-you for writing in, we will be happy to help here. As per my response to your email inquiry, I am discussing this with the team as to how best to approach the above. Once I hear back, I will provide you an update. Thanks
After speaking with the the team, you have options via our wandb artifact upsert calls to append to a non-finalized artifact as output of a run, see here.However, we highly recommend you utilize S3 URI reference instead as it would be the more straightforward approach. Add all data to the S3 bucket then setup reference URI to generate the new artifact with all your processed data. If you run into any issues, please let me know.
Thank you. I think that I am starting understand that Artifact are intended as locked or frozen containers of objects. And this since projects and experiments will reference them historically, they cant be changed once finalized, since that would mess with the historical references.
We are already using s3 references for all our files in the datasets.
Thank you, Kevin
Thank you @kevinashaw for confirming you are using S3 references for all your files.
In regards to Artifacts, your understanding is correct in that Artifacts are containers of objects to track history of changes and can be thought of as a versioned directory. This makes it easy to get a complete and auditable history of changes to your files. Anytime you change the contents of an Artifact, W&B will create a new version of your artifact instead overwriting the previous contents (maintaining the historical reference you mentioned). Please do reach out again anytime you have additional questions.