Hi Francisco,
Sorry, forgot to reply. No problem is not resolved. Here is (almost all) the information you asked for:
- Version: 0.17.3.
- Not clear at which point during syncing it stops, and the code base is too large to share.
- Project: Weights & Biases . All the ones that say ‘crashed’ didn’t actually crash, they just stopped syncing.
- Here are the log files for the runs that failed last. I haven’t check all of them but the ones I have show the same thing:
debug.log:
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Current SDK version is 0.17.3
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Configure stats pid to 2306736
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Loading settings from /home/milton/.config/wandb/settings
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Loading settings from /home/milton/workspace/es-hyper-dev/wandb/settings
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'_require_core': 'true'}
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-07-19 07:19:55,363 WARNING MainThread:2306736 [wandb_setup.py:_flush():76] Could not find program at -m bin.train
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': None, 'program': '-m bin.train'}
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_setup.py:_flush():76] Applying login settings: {}
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_init.py:_log_setup():520] Logging user logs to /home/milton/workspace/es-hyper-dev/wandb/run-20240719_071955-9vez94un/logs/debug.log
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_init.py:_log_setup():521] Logging internal logs to /home/milton/workspace/es-hyper-dev/wandb/run-20240719_071955-9vez94un/logs/debug-internal.log
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_init.py:init():560] calling init triggers
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_init.py:init():567] wandb.init called with sweep_config: {}
config: {}
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_init.py:init():610] starting backend
2024-07-19 07:19:55,363 INFO MainThread:2306736 [wandb_init.py:init():614] setting up manager
2024-07-19 07:19:55,364 INFO MainThread:2306736 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-07-19 07:19:55,366 INFO MainThread:2306736 [wandb_init.py:init():622] backend started and connected
2024-07-19 07:19:55,367 INFO MainThread:2306736 [wandb_init.py:init():711] updated telemetry
2024-07-19 07:19:55,373 INFO MainThread:2306736 [wandb_init.py:init():744] communicating run to backend with 90.0 second timeout
2024-07-19 07:19:55,994 INFO MainThread:2306736 [wandb_run.py:_on_init():2402] communicating current version
2024-07-19 07:19:55,996 INFO MainThread:2306736 [wandb_run.py:_on_init():2411] got version response
2024-07-19 07:19:55,996 INFO MainThread:2306736 [wandb_init.py:init():795] starting run threads in backend
2024-07-19 07:19:56,242 INFO MainThread:2306736 [wandb_run.py:_console_start():2380] atexit reg
2024-07-19 07:19:56,243 INFO MainThread:2306736 [wandb_run.py:_redirect():2235] redirect: wrap_raw
2024-07-19 07:19:56,243 INFO MainThread:2306736 [wandb_run.py:_redirect():2300] Wrapping output streams.
2024-07-19 07:19:56,243 INFO MainThread:2306736 [wandb_run.py:_redirect():2325] Redirects installed.
2024-07-19 07:19:56,245 INFO MainThread:2306736 [wandb_init.py:init():838] run started, returning control to user process
debug-internal.log:
{"time":"2024-07-19T07:19:55.368903672Z","level":"INFO","msg":"using version","core version":"0.17.3"}
{"time":"2024-07-19T07:19:55.368919191Z","level":"INFO","msg":"created symlink","path":"/home/milton/workspace/es-hyper-dev/wandb/run-20240719_071955-9vez94un/logs/debug-core.log"}
{"time":"2024-07-19T07:19:55.406138509Z","level":"INFO","msg":"created new stream","id":"9vez94un"}
{"time":"2024-07-19T07:19:55.406372167Z","level":"INFO","msg":"writer: Do: started","stream_id":{"value":"9vez94un"}}
{"time":"2024-07-19T07:19:55.406265627Z","level":"INFO","msg":"handler: started","stream_id":{"value":"9vez94un"}}
{"time":"2024-07-19T07:19:55.406358642Z","level":"INFO","msg":"sender: started","stream_id":{"value":"9vez94un"}}
{"time":"2024-07-19T07:19:56.000431657Z","level":"INFO","msg":"wandb-core","!BADKEY":null}
{"time":"2024-07-19T07:19:56.043456315Z","level":"INFO","msg":"Starting system monitor"}
{"time":"2024-07-19T10:22:50.13268988Z","level":"ERROR","msg":"HTTP error","status":408,"method":"POST","url":"https://api.wandb.ai/files/mlle/es_hypercube-rl/9vez94un/file_stream"}
{"time":"2024-07-19T10:22:50.132799376Z","level":"ERROR+4","msg":"filestream: fatal error: filestream: failed to upload: 408 Request Timeout"}
As far as I can tell the above match with what I see. Syncing always starts correctly but then it fails at some point. Don’t know if this is an issue with our servers or something related to wandb, or how I could find out which one it is to begin with.
Let me know if you need more info, though I may be slow to respond.