My GPU has 3.5 GBPs upload speed, but it’s taking several minutes to upload my artifacts to W&B when saving training checkpoint. The checkpoint is 15 GB – so this should only take 5 seconds.
The upload is so slow that it causes NCCL timeouts and kills my training runs.
How can I debug this slowdown? Is it possible to have W&B use multiple processes/cores to upload the files to improve throughput on uploads?
Hi @schopra-linum-ai! Thank you for writing in!
Could you please provide the debug.log and debug-internal.log files associated with the run where you are running into this issue? These files should be located in the wandb folder relative to your working directory.
Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.
Hi, since we have not heard back from you, we are going to close this request. If you would like to reopen the conversation, please let us know! Unfortunately, at the moment, we do not receive notifications if a thread reopens on Discourse. So, please feel free to create a new ticket regarding your concern if you’d like to continue the conversation.