Hey @qx66 , the errors you’re encountering, particularly the HTTPError and 429 encountered (Filestraem rate limit exceeded, retrying in 32.7 seconds) , indicate that your runs are hitting the rate limits set by Weights & Biases (W&B). The specific error message 429 encountered (Filestraem rate limit exceeded) indicates that you are making more requests than the allowed rate limit for file streams. This is causing your runs to be throttled, leading to delays and potentially causing crashes if the retry logic is overwhelmed.
What is your use-case? If you could more context regarding what you are trying to do in your training script, that would be really helpful for us to investigate this further and recommend best practices accordingly.
Consideration:
You could reduce the frequency of logging to avoid hitting rate limits. Instead of logging every step, aggregate data and log it less frequently. For example, log every epoch or every few steps. More info and examples on this can be found in our docs.
I have also figured out the reason myself. It seems that there was a period where the jobs shown to be crashed/failed on wandb were still running on SLURM. I didn’t notice this, then I submitted more and more batches of jobs, and reached the file limit of wandb. After I realized this, I manually cancelled some SLURM jobs which don’t match with wandb and the situation was better.