I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer?
Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0
0%| | 0/14272 [00:06<?, ?it/s, train_loss=4.85]2024-04-27 17:18:41,774 - DEBUG - Successfully logged to WandB
0%| | 1/14272 [00:11<25:10:01, 6.35s/it, train_loss=2.9]2024-04-27 17:18:47,054 - DEBUG - Successfully logged to WandB
0%| | 2/14272 [00:16<22:40:19, 5.72s/it, train_loss=2.25]2024-04-27 17:18:52,202 - DEBUG - Successfully logged to WandB
0%| | 3/14272 [00:22<21:38:07, 5.46s/it, train_loss=2.09]2024-04-27 17:18:57,456 - DEBUG - Successfully logged to WandB
0%| | 4/14272 [00:27<21:18:49, 5.38s/it, train_loss=2.02]2024-04-27 17:19:02,601 - DEBUG - Successfully logged to WandB
0%| | 5/14272 [00:32<20:58:47, 5.29s/it, train_loss=1.88]2024-04-27 17:19:07,853 - DEBUG - Successfully logged to WandB
0%| | 6/14272 [00:38<25:34:01, 6.45s/it, train_loss=1.87]
Traceback (most recent call last):
File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 290, in <module>
main()
File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 283, in main
accelerator.log({"train_loss": loss.item()}, step=batch_idx)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 602, in _inner
return PartialState().on_main_process(function)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2267, in log
tracker.log(values, step=step, **log_kwargs.get(tracker.name, {}))
File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 86, in execute_on_main_process
return function(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 333, in log
self.run.log(values, step=step, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1838, in log
self._log(data=data, step=step, commit=commit)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1602, in _log
self._partial_history_callback(data, step, commit)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1474, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 602, in publish_partial_history
self._publish_partial_history(partial_history)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
self._publish(rec)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
2024-04-27 17:19:14,144 - DEBUG - Starting new HTTPS connection (1): o151352.ingest.sentry.io:443
2024-04-27 17:19:16,144 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,144 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
I hope the official can resolve this issue as soon as possible.