Sync issue after training

I am facing a weird issue where models stop syncing halfway through training (at no particular iteration). Wandb tells me to sync afterwards but when I run:

wandb sync wandb/path/to/run-with-id

I get the following error:

I have tried generating a new run id and syncing to that, but I get the same error.

Any ideas what’s happening?

Hi @mlle - thank you for reaching out and welcome to W&B community.

Would you mind sharing some additional information to help us troubleshoot this error you are seeing:

  • What version of the SDK are you currently on wandb --version
  • a snippet of code that stop syncing halfway through
  • a URL to the Workspace for the affected Run
  • The debug.log and debug-internal.log files you can find in `./wandb/run-<Date_time>-<run_id>/logs folder for the failed run

Hi @mlle , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi Francisco,

Sorry, forgot to reply. No problem is not resolved. Here is (almost all) the information you asked for:

  1. Version: 0.17.3.
  2. Not clear at which point during syncing it stops, and the code base is too large to share.
  3. Project: Weights & Biases . All the ones that say ‘crashed’ didn’t actually crash, they just stopped syncing.
  4. Here are the log files for the runs that failed last. I haven’t check all of them but the ones I have show the same thing:

debug.log:

2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Current SDK version is 0.17.3
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Configure stats pid to 2306736
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Loading settings from /home/milton/.config/wandb/settings
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Loading settings from /home/milton/workspace/es-hyper-dev/wandb/settings
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'_require_core': 'true'}
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-07-19 07:19:55,363 WARNING MainThread:2306736 [wandb_setup.py:_flush():76] Could not find program at -m bin.train
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': None, 'program': '-m bin.train'}
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_setup.py:_flush():76] Applying login settings: {}
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_init.py:_log_setup():520] Logging user logs to /home/milton/workspace/es-hyper-dev/wandb/run-20240719_071955-9vez94un/logs/debug.log
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_init.py:_log_setup():521] Logging internal logs to /home/milton/workspace/es-hyper-dev/wandb/run-20240719_071955-9vez94un/logs/debug-internal.log
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_init.py:init():560] calling init triggers
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_init.py:init():567] wandb.init called with sweep_config: {}
config: {}
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_init.py:init():610] starting backend
2024-07-19 07:19:55,363 INFO    MainThread:2306736 [wandb_init.py:init():614] setting up manager
2024-07-19 07:19:55,364 INFO    MainThread:2306736 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-07-19 07:19:55,366 INFO    MainThread:2306736 [wandb_init.py:init():622] backend started and connected
2024-07-19 07:19:55,367 INFO    MainThread:2306736 [wandb_init.py:init():711] updated telemetry
2024-07-19 07:19:55,373 INFO    MainThread:2306736 [wandb_init.py:init():744] communicating run to backend with 90.0 second timeout
2024-07-19 07:19:55,994 INFO    MainThread:2306736 [wandb_run.py:_on_init():2402] communicating current version
2024-07-19 07:19:55,996 INFO    MainThread:2306736 [wandb_run.py:_on_init():2411] got version response 
2024-07-19 07:19:55,996 INFO    MainThread:2306736 [wandb_init.py:init():795] starting run threads in backend
2024-07-19 07:19:56,242 INFO    MainThread:2306736 [wandb_run.py:_console_start():2380] atexit reg
2024-07-19 07:19:56,243 INFO    MainThread:2306736 [wandb_run.py:_redirect():2235] redirect: wrap_raw
2024-07-19 07:19:56,243 INFO    MainThread:2306736 [wandb_run.py:_redirect():2300] Wrapping output streams.
2024-07-19 07:19:56,243 INFO    MainThread:2306736 [wandb_run.py:_redirect():2325] Redirects installed.
2024-07-19 07:19:56,245 INFO    MainThread:2306736 [wandb_init.py:init():838] run started, returning control to user process

debug-internal.log:

{"time":"2024-07-19T07:19:55.368903672Z","level":"INFO","msg":"using version","core version":"0.17.3"}
{"time":"2024-07-19T07:19:55.368919191Z","level":"INFO","msg":"created symlink","path":"/home/milton/workspace/es-hyper-dev/wandb/run-20240719_071955-9vez94un/logs/debug-core.log"}
{"time":"2024-07-19T07:19:55.406138509Z","level":"INFO","msg":"created new stream","id":"9vez94un"}
{"time":"2024-07-19T07:19:55.406372167Z","level":"INFO","msg":"writer: Do: started","stream_id":{"value":"9vez94un"}}
{"time":"2024-07-19T07:19:55.406265627Z","level":"INFO","msg":"handler: started","stream_id":{"value":"9vez94un"}}
{"time":"2024-07-19T07:19:55.406358642Z","level":"INFO","msg":"sender: started","stream_id":{"value":"9vez94un"}}
{"time":"2024-07-19T07:19:56.000431657Z","level":"INFO","msg":"wandb-core","!BADKEY":null}
{"time":"2024-07-19T07:19:56.043456315Z","level":"INFO","msg":"Starting system monitor"}
{"time":"2024-07-19T10:22:50.13268988Z","level":"ERROR","msg":"HTTP error","status":408,"method":"POST","url":"https://api.wandb.ai/files/mlle/es_hypercube-rl/9vez94un/file_stream"}
{"time":"2024-07-19T10:22:50.132799376Z","level":"ERROR+4","msg":"filestream: fatal error: filestream: failed to upload: 408 Request Timeout"}

As far as I can tell the above match with what I see. Syncing always starts correctly but then it fails at some point. Don’t know if this is an issue with our servers or something related to wandb, or how I could find out which one it is to begin with.

Let me know if you need more info, though I may be slow to respond.

Hi @mlle, apologies for my late reply on this.

Looking at the logs you have sent:

{"time":"2024-07-19T10:22:50.13268988Z","level":"ERROR","msg":"HTTP error","status":408,"method":"POST","url":"https://api.wandb.ai/files/mlle/es_hypercube-rl/9vez94un/file_stream"}
{"time":"2024-07-19T10:22:50.132799376Z","level":"ERROR+4","msg":"filestream: fatal error: filestream: failed to upload: 408 Request Timeout"}

This 408 error is generally associated with a slow request timing out. What compute and network environment are you running your experiment from? is this sitting behind any firewall, load balancer or proxy by any chance?

Hi @mlle , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi Francesco,

Apologies I was on holiday. Yes, the server is behind a firewall. However, as far as I am aware, my colleagues do not have the same issue and they are also using wandb. Maybe it is because I run several runs in parallel?

Best,

Milton