I’m experiencing a persistent issue with wandb where a 0.190 MB file upload has been stuck for over 6 hours, preventing me from running any further code. This occurred after I interrupted a training session.
Hi @eman-ali9191 , Happy to help.
Could you provide the following:
- Tell us more how are you trying to upload, and what file/s you are trying to upload? Is it via the application or API?
- Are you logging files as artifact or saving files to your runs?
- briefly describe how did you interrupt a training session?
- Wandb version you are using
- Code snippet of how you are uploading these files.
- The
debug.log
anddebug-internal.log
files for the affected run/s(that are stuck). They are located in the working directory under thewandb/run-<date>_<time>-<run-id>/logs
folder
For us to better understand the issue and further investigate it.
Hello @eman-ali9191 , may we follow up on the previous request please. Thanks~
- Are you logging files as artifact or saving files to your runs: saving files to my runs
- briefly describe how did you interrupt a training session: ctrl+c
- Wandb version you are using : 0.16.0
- Code snippet of how you are uploading these files.
wandb.log({"loss": avg_loss, \
"Accuracy " : training_accuracy_student
})
Finally, I restart the machine and rerun the code again and it worked.
Hello @phantrang3564 , thank you for all the details you provided and glad that it worked when you restarted your machine. With that we will mark this as resolved but feel free to write in again when you encounter any issue/s.
I believe this happens when you reach the rate-limit imposed on every user and then it get’s stuck on the finish function. Try using top -u <your-user-name>
to find the wandb-service
service and then kill it with kill -9 <the-service-PID>
.