Artefact upload very slow

I’m encountering a similar issue to the one reported here: Programmatically accessing artifact object very slow for first call for large artifacts

I’ve found artefacts to be an excellent way for storing the full outputs of my models for later debugging. However , as I’m training information retrieval models my artefacts are rather large (~300MB). I’m only storing titles of my documents but even with that each evaluation example has around 300 titles as an output.

At the end of each WANDB run it takes a couple of hours for the run to sync. I’m running the experiments on GCP VMs so internet speed should not be an issue.

Do you have any ideas on how I could speed up the sync time?

As I’m running multiple experiments sequentially, atm the experiments are blocked by WANDB upload time. I’m thinking as a quick workaround to disable automatic syncing from my scripts and run a wandb sync; sleep loop on a parallel process in the same directory. Does that sound like a reasonable way to go forward?

Hi @boscience , thank you for writing in and providing insight/feedback about artifact uploads. Our eng team is prioritizing improvements to artifacts usage and upload workflow that will significantly reduce upload times. These improvements will roll early next year.

In regards to the problem you are facing, wandb writes artifacts through the cache. As files are uploaded or downloaded, which happens asynchronously when you call log_artifact, the upload shouldn’t be blocking your experiments.

  • Are you using the latest wandb client version?
  • Are you using artifact.wait() anywhere in your script?
  • Which methods calls are you using to upload artifacts?
  • Are the large artifacts single models or model checkpoint versions being constantly uploaded?

Hi @mohammadbakir,

  • Are you using the latest wandb client version?

Yes, I’m using Python client, version 0.13.5.

Are you using artifact.wait() anywhere in your script?

No

Which methods calls are you using to upload artifacts?

I’m using only wandb.log statements, as the model only requires a single training step after setup. I make multiple calls to wandb.log with commit=Falseand then a single call with commit=True, that’s the call where I log the artifact, which is a wandb.Table.

Are the large artifacts single models or model checkpoint versions being constantly uploaded?

The large artifact is a wandb.Table.

At the end of my experiments, it hangs at the syncing step:

wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run hopeful-firebrand-60
wandb: ⭐️ View project at https://wandb.ai/boclips/search-eval
wandb: 🚀 View run at https://wandb.ai/boclips/search-eval/runs/2iid9htu

Thank you for the update @boscience . Could you provide us the debug.log and debug-internal.log files for the runs that are hanging. They will provide additional clues to what is occurring. Please send them to support@wandb.ai and include my name in the subject line, thank you.

Hi @boscience ,since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!