BrokenPipeError on Ubuntu machine

A VAE built using PyTorch runs smoothly when I train it directly. However, with wandb sweeps, I encounter the following BrokerPipeError. According to some forum threads, the main cause seems to be the more than one num_workers in the DataLoader module when running on Windows OS. However, I have a DGX-station running Ubuntu, and I still get the error.

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
Exception in thread ChkStopThr:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    Exception in thread self._target(*self._args, **self._kwargs)NetStatThr
:
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 276, in check_stop_status
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self._loop_check_status(
      File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 214, in _loop_check_status
self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    local_handle = request()
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 787, in deliver_stop_status
    self._target(*self._args, **self._kwargs)
return self._deliver_stop_status(status)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 258, in check_network_status
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 585, in _deliver_stop_status
    self._loop_check_status(
return self._deliver_record(record)  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 214, in _loop_check_status

  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 560, in _deliver_record
    handle = mailbox._deliver_record(record, interface=self)
    local_handle = request()  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record

  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 795, in deliver_network_status
    interface._publish(record)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    return self._deliver_network_status(status)
      File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 601, in _deliver_network_status
self.send_server_request(server_req)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    return self._deliver_record(record)
      File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 560, in _deliver_record
self._sendall_with_error_handle(header + data)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    handle = mailbox._deliver_record(record, interface=self)
sent = self._sock.send(data)  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record

BrokenPipeError: [Errno 32] Broken pipe
    interface._publish(record)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

Any help would be greatly appreciated. Thank you in advance!

Hello @janandd !

Would you be able to send the debug bundle for the run that is running into the BrokenPipeError?

They should be located in the wandb folder in the same directory as where the script was run. The wandb folder has folders formatted as run-DATETIME-ID associated with a single run. Could you retrieve the debug.log and debug-internal.log files from one of these folders specifically from the run that is having issues?

Thanks!

I tried running the sweep again today, and am getting the same error. The two requested log files are pasted below.

  1. debug.log
2023-04-05 01:59:48,222 INFO    MainThread:8836 [wandb_setup.py:_flush():76] Configure stats pid to 8836
2023-04-05 01:59:48,222 INFO    MainThread:8836 [wandb_setup.py:_flush():76] Loading settings from /root/.config/wandb/settings
2023-04-05 01:59:48,222 INFO    MainThread:8836 [wandb_setup.py:_flush():76] Loading settings from /root/src/my_data_dir/wandb/settings
2023-04-05 01:59:48,222 INFO    MainThread:8836 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2023-04-05 01:59:48,222 INFO    MainThread:8836 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2023-04-05 01:59:48,223 INFO    MainThread:8836 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'vae_sweep.py', 'program': 'vae_sweep.py'}
2023-04-05 01:59:48,223 INFO    MainThread:8836 [wandb_init.py:_log_setup():506] Logging user logs to /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/logs/debug.log
2023-04-05 01:59:48,223 INFO    MainThread:8836 [wandb_init.py:_log_setup():507] Logging internal logs to /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/logs/debug-internal.log
2023-04-05 01:59:48,223 INFO    MainThread:8836 [wandb_init.py:init():546] calling init triggers
2023-04-05 01:59:48,223 INFO    MainThread:8836 [wandb_init.py:init():552] wandb.init called with sweep_config: {}
config: {}
2023-04-05 01:59:48,223 INFO    MainThread:8836 [wandb_init.py:init():602] starting backend
2023-04-05 01:59:48,223 INFO    MainThread:8836 [wandb_init.py:init():606] setting up manager
2023-04-05 01:59:48,229 INFO    MainThread:8836 [backend.py:_multiprocessing_setup():106] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2023-04-05 01:59:48,232 INFO    MainThread:8836 [wandb_init.py:init():613] backend started and connected
2023-04-05 01:59:48,235 INFO    MainThread:8836 [wandb_init.py:init():701] updated telemetry
2023-04-05 01:59:48,256 INFO    MainThread:8836 [wandb_init.py:init():741] communicating run to backend with 60.0 second timeout
2023-04-05 01:59:48,699 INFO    MainThread:8836 [wandb_run.py:_on_init():2133] communicating current version
2023-04-05 01:59:48,750 INFO    MainThread:8836 [wandb_run.py:_on_init():2142] got version response 
2023-04-05 01:59:48,750 INFO    MainThread:8836 [wandb_init.py:init():789] starting run threads in backend
2023-04-05 01:59:52,623 INFO    MainThread:8836 [wandb_run.py:_console_start():2114] atexit reg
2023-04-05 01:59:52,623 INFO    MainThread:8836 [wandb_run.py:_redirect():1969] redirect: SettingsConsole.WRAP_RAW
2023-04-05 01:59:52,687 INFO    MainThread:8836 [wandb_run.py:_redirect():2034] Wrapping output streams.
2023-04-05 01:59:52,687 INFO    MainThread:8836 [wandb_run.py:_redirect():2059] Redirects installed.
2023-04-05 01:59:52,688 INFO    MainThread:8836 [wandb_init.py:init():831] run started, returning control to user process
2023-04-05 01:59:54,368 INFO    MainThread:8836 [pyagent.py:run():314] Starting sweep agent: entity=None, project=None, count=1
2023-04-05 02:00:03,132 WARNING MsgRouterThr:8836 [router.py:message_loop():77] message_loop has been closed
2023-04-05 02:00:05,766 INFO    Thread-5  :8836 [wandb_setup.py:_flush():76] Configure stats pid to 8836
2023-04-05 02:00:05,767 INFO    Thread-5  :8836 [wandb_setup.py:_flush():76] Loading settings from /root/.config/wandb/settings
2023-04-05 02:00:05,767 INFO    Thread-5  :8836 [wandb_setup.py:_flush():76] Loading settings from /root/src/my_data_dir/wandb/settings
2023-04-05 02:00:05,767 INFO    Thread-5  :8836 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'project': 'yuzu_vae', 'entity': 'myname', 'root_dir': '/root/src/my_data_dir', 'sweep_id': 'avb3871x', 'run_id': 'pcl80d2k', 'sweep_param_path': '/root/src/my_data_dir/wandb/sweep-avb3871x/config-pcl80d2k.yaml'}
2023-04-05 02:00:05,767 INFO    Thread-5  :8836 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2023-04-05 02:00:05,767 INFO    Thread-5  :8836 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'vae_sweep.py', 'program': 'vae_sweep.py'}
2023-04-05 02:00:05,768 INFO    Thread-5  :8836 [wandb_init.py:_log_setup():506] Logging user logs to /root/src/my_data_dir/wandb/run-20230405_020005-pcl80d2k/logs/debug.log
2023-04-05 02:00:05,768 INFO    Thread-5  :8836 [wandb_init.py:_log_setup():507] Logging internal logs to /root/src/my_data_dir/wandb/run-20230405_020005-pcl80d2k/logs/debug-internal.log
2023-04-05 02:00:05,768 INFO    Thread-5  :8836 [wandb_init.py:init():546] calling init triggers
2023-04-05 02:00:05,768 INFO    Thread-5  :8836 [wandb_init.py:init():552] wandb.init called with sweep_config: {'batch_size': 64, 'epochs': 20, 'latent_dims': 132, 'learning_rate': 5.413127424880074e-06, 'optimizer': 'sgd'}
config: {}
2023-04-05 02:00:05,769 INFO    Thread-5  :8836 [wandb_init.py:init():597] wandb.init() called when a run is still active
2023-04-05 02:00:05,792 ERROR   Thread-5  :8836 [wandb_init.py:init():1171] error
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1144, in init
    run = wi.init()
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 599, in init
    tel.feature.init_return_run = True
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/telemetry.py", line 42, in __exit__
    self._run._telemetry_callback(self._obj)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 691, in _telemetry_callback
    self._telemetry_flush()
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 702, in _telemetry_flush
    self._backend.interface._publish_telemetry(self._telemetry_obj)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 101, in _publish_telemetry
    self._publish(rec)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
  1. debug-internal.log
2023-04-05 01:59:48,237 INFO    StreamThr :8851 [internal.py:wandb_internal():87] W&B internal server running at pid: 8851, started at: 2023-04-05 01:59:48.236369
2023-04-05 01:59:48,246 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: status
2023-04-05 01:59:48,247 INFO    WriterThread:8851 [datastore.py:open_for_write():85] open: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/run-4q1n1gag.wandb
2023-04-05 01:59:48,248 DEBUG   SenderThread:8851 [sender.py:send():336] send: header
2023-04-05 01:59:48,258 DEBUG   SenderThread:8851 [sender.py:send():336] send: run
2023-04-05 01:59:48,694 INFO    SenderThread:8851 [dir_watcher.py:__init__():219] watching files in: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files
2023-04-05 01:59:48,694 INFO    SenderThread:8851 [sender.py:_start_run_threads():1078] run started: 4q1n1gag with start time 1680659988.232314
2023-04-05 01:59:48,694 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: summary_record
2023-04-05 01:59:48,695 INFO    SenderThread:8851 [sender.py:_save_file():1332] saving file wandb-summary.json with policy end
2023-04-05 01:59:48,700 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: check_version
2023-04-05 01:59:48,700 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: check_version
2023-04-05 01:59:48,762 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: run_start
2023-04-05 01:59:48,767 DEBUG   HandlerThread:8851 [system_info.py:__init__():31] System info init
2023-04-05 01:59:48,767 DEBUG   HandlerThread:8851 [system_info.py:__init__():46] System info init done
2023-04-05 01:59:48,767 INFO    HandlerThread:8851 [system_monitor.py:start():183] Starting system monitor
2023-04-05 01:59:48,767 INFO    SystemMonitor:8851 [system_monitor.py:_start():147] Starting system asset monitoring threads
2023-04-05 01:59:48,768 INFO    HandlerThread:8851 [system_monitor.py:probe():204] Collecting system info
2023-04-05 01:59:48,768 INFO    SystemMonitor:8851 [interfaces.py:start():187] Started cpu monitoring
2023-04-05 01:59:48,769 INFO    SystemMonitor:8851 [interfaces.py:start():187] Started disk monitoring
2023-04-05 01:59:48,771 INFO    SystemMonitor:8851 [interfaces.py:start():187] Started gpu monitoring
2023-04-05 01:59:48,772 INFO    SystemMonitor:8851 [interfaces.py:start():187] Started memory monitoring
2023-04-05 01:59:48,773 INFO    SystemMonitor:8851 [interfaces.py:start():187] Started network monitoring
2023-04-05 01:59:49,594 DEBUG   HandlerThread:8851 [system_info.py:probe():195] Probing system
2023-04-05 01:59:49,604 DEBUG   HandlerThread:8851 [system_info.py:_probe_git():180] Probing git
2023-04-05 01:59:49,623 DEBUG   HandlerThread:8851 [system_info.py:_probe_git():188] Probing git done
2023-04-05 01:59:49,623 DEBUG   HandlerThread:8851 [system_info.py:probe():240] Probing system done
2023-04-05 01:59:49,623 DEBUG   HandlerThread:8851 [system_monitor.py:probe():213] {'os': 'Linux-5.4.0-131-generic-x86_64-with-glibc2.10', 'python': '3.8.5', 'heartbeatAt': '2023-04-05T01:59:49.594996', 'startedAt': '2023-04-05T01:59:48.218890', 'docker': None, 'cuda': None, 'args': (), 'state': 'running', 'program': 'vae_sweep.py', 'codePath': 'vae_sweep.py', 'git': {'remote': 'https://github.com/codjp/my_data_dir', 'commit': '61814805977cf5b5cd4a1583b97c0c8e76348dfa'}, 'email': None, 'root': '/root/src/my_data_dir', 'host': 'dbe83d4d908b', 'username': 'root', 'executable': '/opt/conda/bin/python3', 'cpu_count': 20, 'cpu_count_logical': 40, 'cpu_freq': {'current': 1271.777825, 'min': 1200.0, 'max': 3600.0}, 'cpu_freq_per_core': [{'current': 1199.494, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.687, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.113, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.12, 'min': 1200.0, 'max': 3600.0}, {'current': 2207.272, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.059, 'min': 1200.0, 'max': 3600.0}, {'current': 1204.13, 'min': 1200.0, 'max': 3600.0}, {'current': 1610.194, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.236, 'min': 1200.0, 'max': 3600.0}, {'current': 1201.727, 'min': 1200.0, 'max': 3600.0}, {'current': 1203.113, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.745, 'min': 1200.0, 'max': 3600.0}, {'current': 1201.371, 'min': 1200.0, 'max': 3600.0}, {'current': 1201.077, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.408, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.291, 'min': 1200.0, 'max': 3600.0}, {'current': 1201.383, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.354, 'min': 1200.0, 'max': 3600.0}, {'current': 1203.227, 'min': 1200.0, 'max': 3600.0}, {'current': 1201.547, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.978, 'min': 1200.0, 'max': 3600.0}, {'current': 1203.072, 'min': 1200.0, 'max': 3600.0}, {'current': 1201.825, 'min': 1200.0, 'max': 3600.0}, {'current': 1202.402, 'min': 1200.0, 'max': 3600.0}, {'current': 2204.793, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.906, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.096, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.317, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.291, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.287, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.292, 'min': 1200.0, 'max': 3600.0}, {'current': 1198.959, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.288, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.291, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.277, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.299, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.295, 'min': 1200.0, 'max': 3600.0}, {'current': 1198.99, 'min': 1200.0, 'max': 3600.0}, {'current': 1199.873, 'min': 1200.0, 'max': 3600.0}, {'current': 1198.496, 'min': 1200.0, 'max': 3600.0}], 'disk': {'total': 1759.7716484069824, 'used': 1582.3762550354004}, 'gpu': 'Tesla V100-DGXS-32GB', 'gpu_count': 3, 'gpu_devices': [{'name': 'Tesla V100-DGXS-32GB', 'memory_total': 34078457856}, {'name': 'Tesla V100-DGXS-32GB', 'memory_total': 34087305216}, {'name': 'Tesla V100-DGXS-32GB', 'memory_total': 34087305216}], 'memory': {'total': 251.62277603149414}}
2023-04-05 01:59:49,624 INFO    HandlerThread:8851 [system_monitor.py:probe():214] Finished collecting system info
2023-04-05 01:59:49,624 INFO    HandlerThread:8851 [system_monitor.py:probe():217] Publishing system info
2023-04-05 01:59:49,624 DEBUG   HandlerThread:8851 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment
2023-04-05 01:59:49,625 DEBUG   HandlerThread:8851 [system_info.py:_save_pip():67] Saving pip packages done
2023-04-05 01:59:49,625 DEBUG   HandlerThread:8851 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment
2023-04-05 01:59:49,700 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_created():278] file/dir created: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/conda-environment.yaml
2023-04-05 01:59:49,701 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_created():278] file/dir created: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/wandb-summary.json
2023-04-05 01:59:49,701 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_created():278] file/dir created: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/requirements.txt
2023-04-05 01:59:52,599 DEBUG   HandlerThread:8851 [system_info.py:_save_conda():86] Saving conda packages done
2023-04-05 01:59:52,602 INFO    HandlerThread:8851 [system_monitor.py:probe():219] Finished publishing system info
2023-04-05 01:59:52,616 DEBUG   SenderThread:8851 [sender.py:send():336] send: files
2023-04-05 01:59:52,617 INFO    SenderThread:8851 [sender.py:_save_file():1332] saving file wandb-metadata.json with policy now
2023-04-05 01:59:52,689 DEBUG   SenderThread:8851 [sender.py:send():336] send: telemetry
2023-04-05 01:59:52,702 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_created():278] file/dir created: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/wandb-metadata.json
2023-04-05 01:59:52,711 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: stop_status
2023-04-05 01:59:52,712 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: stop_status
2023-04-05 01:59:53,228 INFO    wandb-upload_0:8851 [upload_job.py:push():138] Uploaded file /tmp/tmpxt7ytk9nwandb/wysk2xtq-wandb-metadata.json
2023-04-05 01:59:53,913 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: status_report
2023-04-05 01:59:54,702 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_created():278] file/dir created: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/output.log
2023-04-05 01:59:54,952 DEBUG   SenderThread:8851 [sender.py:send():336] send: exit
2023-04-05 01:59:54,953 INFO    SenderThread:8851 [sender.py:send_exit():559] handling exit code: 0
2023-04-05 01:59:54,953 INFO    SenderThread:8851 [sender.py:send_exit():561] handling runtime: 6
2023-04-05 01:59:54,957 INFO    SenderThread:8851 [sender.py:_save_file():1332] saving file wandb-summary.json with policy end
2023-04-05 01:59:54,958 INFO    SenderThread:8851 [sender.py:send_exit():567] send defer
2023-04-05 01:59:54,958 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:54,959 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 0
2023-04-05 01:59:54,959 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:54,959 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 0
2023-04-05 01:59:54,959 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 1
2023-04-05 01:59:54,960 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:54,960 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 1
2023-04-05 01:59:54,960 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:54,961 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 1
2023-04-05 01:59:54,961 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 2
2023-04-05 01:59:54,961 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:54,961 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 2
2023-04-05 01:59:54,961 INFO    HandlerThread:8851 [system_monitor.py:finish():193] Stopping system monitor
2023-04-05 01:59:54,962 DEBUG   SystemMonitor:8851 [system_monitor.py:_start():161] Starting system metrics aggregation loop
2023-04-05 01:59:54,963 DEBUG   SystemMonitor:8851 [system_monitor.py:_start():168] Finished system metrics aggregation loop
2023-04-05 01:59:54,964 DEBUG   SystemMonitor:8851 [system_monitor.py:_start():172] Publishing last batch of metrics
2023-04-05 01:59:54,966 INFO    HandlerThread:8851 [interfaces.py:finish():199] Joined cpu monitor
2023-04-05 01:59:54,967 INFO    HandlerThread:8851 [interfaces.py:finish():199] Joined disk monitor
2023-04-05 01:59:54,993 INFO    HandlerThread:8851 [interfaces.py:finish():199] Joined gpu monitor
2023-04-05 01:59:54,994 INFO    HandlerThread:8851 [interfaces.py:finish():199] Joined memory monitor
2023-04-05 01:59:54,994 INFO    HandlerThread:8851 [interfaces.py:finish():199] Joined network monitor
2023-04-05 01:59:54,995 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:54,995 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 2
2023-04-05 01:59:54,995 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 3
2023-04-05 01:59:54,996 DEBUG   SenderThread:8851 [sender.py:send():336] send: stats
2023-04-05 01:59:54,996 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:54,997 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 3
2023-04-05 01:59:54,998 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:54,998 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 3
2023-04-05 01:59:54,998 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 4
2023-04-05 01:59:54,998 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:54,998 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 4
2023-04-05 01:59:54,999 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:54,999 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 4
2023-04-05 01:59:54,999 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 5
2023-04-05 01:59:54,999 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:54,999 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 5
2023-04-05 01:59:55,000 DEBUG   SenderThread:8851 [sender.py:send():336] send: summary
2023-04-05 01:59:55,001 INFO    SenderThread:8851 [sender.py:_save_file():1332] saving file wandb-summary.json with policy end
2023-04-05 01:59:55,001 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:55,001 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 5
2023-04-05 01:59:55,001 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 6
2023-04-05 01:59:55,002 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:55,002 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 6
2023-04-05 01:59:55,002 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:55,002 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 6
2023-04-05 01:59:55,008 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: status_report
2023-04-05 01:59:55,256 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 7
2023-04-05 01:59:55,256 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:55,257 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 7
2023-04-05 01:59:55,257 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:55,257 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 7
2023-04-05 01:59:55,703 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_modified():295] file/dir modified: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/config.yaml
2023-04-05 01:59:55,704 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_modified():295] file/dir modified: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/wandb-summary.json
2023-04-05 01:59:55,953 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: poll_exit
2023-04-05 01:59:56,376 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 8
2023-04-05 01:59:56,377 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: poll_exit
2023-04-05 01:59:56,377 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:56,378 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 8
2023-04-05 01:59:56,378 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:56,378 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 8
2023-04-05 01:59:56,389 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 9
2023-04-05 01:59:56,389 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 01:59:56,390 DEBUG   SenderThread:8851 [sender.py:send():336] send: artifact
2023-04-05 01:59:56,390 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 9
2023-04-05 01:59:56,704 INFO    Thread-13 :8851 [dir_watcher.py:_on_file_modified():295] file/dir modified: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/output.log
2023-04-05 01:59:56,954 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: poll_exit
2023-04-05 01:59:57,943 INFO    wandb-upload_0:8851 [upload_job.py:push():96] Uploaded file /root/.local/share/wandb/artifacts/staging/tmpl5hz4tzp
2023-04-05 01:59:57,950 INFO    wandb-upload_1:8851 [upload_job.py:push():96] Uploaded file /root/.local/share/wandb/artifacts/staging/tmpydxrxzl1
2023-04-05 01:59:59,852 INFO    SenderThread:8851 [sender.py:send_artifact():1428] sent artifact job-https___github.com_codjp_my_data_dir_vae_sweep.py - {'id': 'QXJ0aWZhY3Q6NDEzODk0MDU3', 'digest': 'ec88d1c400e70291591f8965d41aea40', 'state': 'PENDING', 'aliases': [], 'artifactSequence': {'id': 'QXJ0aWZhY3RDb2xsZWN0aW9uOjYwNDEzMzY0', 'latestArtifact': None}, 'version': 'latest'}
2023-04-05 01:59:59,852 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 01:59:59,852 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 9
2023-04-05 01:59:59,852 INFO    SenderThread:8851 [dir_watcher.py:finish():365] shutting down directory watcher
2023-04-05 02:00:00,705 INFO    SenderThread:8851 [dir_watcher.py:finish():395] scan: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files
2023-04-05 02:00:00,706 INFO    SenderThread:8851 [dir_watcher.py:finish():409] scan save: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/wandb-metadata.json wandb-metadata.json
2023-04-05 02:00:00,706 INFO    SenderThread:8851 [dir_watcher.py:finish():409] scan save: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/config.yaml config.yaml
2023-04-05 02:00:00,706 INFO    SenderThread:8851 [dir_watcher.py:finish():409] scan save: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/conda-environment.yaml conda-environment.yaml
2023-04-05 02:00:00,717 INFO    SenderThread:8851 [dir_watcher.py:finish():409] scan save: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/wandb-summary.json wandb-summary.json
2023-04-05 02:00:00,718 INFO    SenderThread:8851 [dir_watcher.py:finish():409] scan save: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/output.log output.log
2023-04-05 02:00:00,718 INFO    SenderThread:8851 [dir_watcher.py:finish():409] scan save: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/requirements.txt requirements.txt
2023-04-05 02:00:00,727 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 10
2023-04-05 02:00:00,727 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: poll_exit
2023-04-05 02:00:00,732 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 02:00:00,738 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 10
2023-04-05 02:00:00,748 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 02:00:00,748 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 10
2023-04-05 02:00:00,749 INFO    SenderThread:8851 [file_pusher.py:finish():164] shutting down file pusher
2023-04-05 02:00:01,262 INFO    wandb-upload_1:8851 [upload_job.py:push():138] Uploaded file /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/wandb-summary.json
2023-04-05 02:00:01,348 INFO    wandb-upload_0:8851 [upload_job.py:push():138] Uploaded file /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/config.yaml
2023-04-05 02:00:01,357 INFO    wandb-upload_2:8851 [upload_job.py:push():138] Uploaded file /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/output.log
2023-04-05 02:00:01,533 INFO    wandb-upload_3:8851 [upload_job.py:push():138] Uploaded file /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/files/requirements.txt
2023-04-05 02:00:01,733 INFO    Thread-12 :8851 [sender.py:transition_state():587] send defer: 11
2023-04-05 02:00:01,734 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 02:00:01,734 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 11
2023-04-05 02:00:01,735 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 02:00:01,735 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 11
2023-04-05 02:00:01,735 INFO    SenderThread:8851 [file_pusher.py:join():169] waiting for file pusher
2023-04-05 02:00:01,735 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 12
2023-04-05 02:00:01,735 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 02:00:01,735 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 12
2023-04-05 02:00:01,736 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 02:00:01,736 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 12
2023-04-05 02:00:01,922 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 13
2023-04-05 02:00:01,923 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 02:00:01,923 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 13
2023-04-05 02:00:01,923 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 02:00:01,924 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 13
2023-04-05 02:00:01,924 INFO    SenderThread:8851 [sender.py:transition_state():587] send defer: 14
2023-04-05 02:00:01,925 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: defer
2023-04-05 02:00:01,925 DEBUG   SenderThread:8851 [sender.py:send():336] send: final
2023-04-05 02:00:01,925 INFO    HandlerThread:8851 [handler.py:handle_request_defer():170] handle defer: 14
2023-04-05 02:00:01,926 DEBUG   SenderThread:8851 [sender.py:send():336] send: footer
2023-04-05 02:00:01,926 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: defer
2023-04-05 02:00:01,926 INFO    SenderThread:8851 [sender.py:send_request_defer():583] handle sender defer: 14
2023-04-05 02:00:01,928 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: poll_exit
2023-04-05 02:00:01,928 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: poll_exit
2023-04-05 02:00:01,929 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: server_info
2023-04-05 02:00:01,930 DEBUG   SenderThread:8851 [sender.py:send_request():363] send_request: server_info
2023-04-05 02:00:01,937 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: get_summary
2023-04-05 02:00:01,939 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: sampled_history
2023-04-05 02:00:02,129 INFO    MainThread:8851 [wandb_run.py:_footer_history_summary_info():3422] rendering history
2023-04-05 02:00:02,129 INFO    MainThread:8851 [wandb_run.py:_footer_history_summary_info():3454] rendering summary
2023-04-05 02:00:02,130 INFO    MainThread:8851 [wandb_run.py:_footer_sync_info():3380] logging synced files
2023-04-05 02:00:02,130 DEBUG   HandlerThread:8851 [handler.py:handle_request():144] handle_request: shutdown
2023-04-05 02:00:02,131 INFO    HandlerThread:8851 [handler.py:finish():842] shutting down handler
2023-04-05 02:00:02,930 INFO    WriterThread:8851 [datastore.py:close():298] close: /root/src/my_data_dir/wandb/run-20230405_015948-4q1n1gag/run-4q1n1gag.wandb
2023-04-05 02:00:03,129 INFO    SenderThread:8851 [sender.py:finish():1504] shutting down sender
2023-04-05 02:00:03,129 INFO    SenderThread:8851 [file_pusher.py:finish():164] shutting down file pusher
2023-04-05 02:00:03,129 INFO    SenderThread:8851 [file_pusher.py:join():169] waiting for file pusher

Thank you!

Hello! Looks like it there is a Connection issue between your machine and the wandb server. Is there a load balancer, a VPN, or a proxy that your machine is behind that may be blocking the connection? The reason I ask is because sent = self._sock.send(data) is the main error in the stack trace which means that the client is struggling to send data to the server.

2 Likes

That most likely seems to be the reason. The machine running my code is in a VPN, and it may not be possible for it to access wandb server.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.