Slow Artifact Upload

My GPU has 3.5 GBPs upload speed, but it’s taking several minutes to upload my artifacts to W&B when saving training checkpoint. The checkpoint is 15 GB – so this should only take 5 seconds.

The upload is so slow that it causes NCCL timeouts and kills my training runs.

How can I debug this slowdown? Is it possible to have W&B use multiple processes/cores to upload the files to improve throughput on uploads?

Could you please provide the debug.log and debug-internal.log files associated with the run where you are running into this issue? These files should be located in the wandb folder relative to your working directory.

