Network error (HTTPError), entering retry loop

Dear Weights & Biases,

I kept getting crashes recently, particularly when starting runs and logging data. The error messages are:
wandb: Network error (HTTPError), entering retry loop
wandb: 429 encountered (Filestraem rate limit exceeded, retrying in 32.7 seconds.), retrying requests

Could you help me figure out the issues and the solutions? Thanks!

Hey @qx66 , the errors you’re encountering, particularly the HTTPError and 429 encountered (Filestraem rate limit exceeded, retrying in 32.7 seconds) , indicate that your runs are hitting the rate limits set by Weights & Biases (W&B). The specific error message 429 encountered (Filestraem rate limit exceeded) indicates that you are making more requests than the allowed rate limit for file streams. This is causing your runs to be throttled, leading to delays and potentially causing crashes if the retry logic is overwhelmed.

  • Could you share your wandb username so that we can verify what your current rate limits are?
  • What is your use-case? If you could more context regarding what you are trying to do in your training script, that would be really helpful for us to investigate this further and recommend best practices accordingly.


  • You could reduce the frequency of logging to avoid hitting rate limits. Instead of logging every step, aggregate data and log it less frequently. For example, log every epoch or every few steps. More info and examples on this can be found in our docs.

Thanks for your reply!

I have also figured out the reason myself. It seems that there was a period where the jobs shown to be crashed/failed on wandb were still running on SLURM. I didn’t notice this, then I submitted more and more batches of jobs, and reached the file limit of wandb. After I realized this, I manually cancelled some SLURM jobs which don’t match with wandb and the situation was better.

1 Like

Thanks for the update, @qx66 . Do you have any other queries for us?

Hi Qian, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!