Resolving Wandb Service Interruption When Using GPU for Model Training

I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer?

Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0
  0%|                                                                                                 | 0/14272 [00:06<?, ?it/s, train_loss=4.85]2024-04-27 17:18:41,774 - DEBUG - Successfully logged to WandB
  0%|                                                                                       | 1/14272 [00:11<25:10:01,  6.35s/it, train_loss=2.9]2024-04-27 17:18:47,054 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 2/14272 [00:16<22:40:19,  5.72s/it, train_loss=2.25]2024-04-27 17:18:52,202 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 3/14272 [00:22<21:38:07,  5.46s/it, train_loss=2.09]2024-04-27 17:18:57,456 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 4/14272 [00:27<21:18:49,  5.38s/it, train_loss=2.02]2024-04-27 17:19:02,601 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 5/14272 [00:32<20:58:47,  5.29s/it, train_loss=1.88]2024-04-27 17:19:07,853 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 6/14272 [00:38<25:34:01,  6.45s/it, train_loss=1.87]
Traceback (most recent call last):
  File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 290, in <module>
    main()
  File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 283, in main
    accelerator.log({"train_loss":  loss.item()}, step=batch_idx)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 602, in _inner
    return PartialState().on_main_process(function)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2267, in log
    tracker.log(values, step=step, **log_kwargs.get(tracker.name, {}))
  File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 86, in execute_on_main_process
    return function(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 333, in log
    self.run.log(values, step=step, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1838, in log
    self._log(data=data, step=step, commit=commit)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1602, in _log
    self._partial_history_callback(data, step, commit)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1474, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 602, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
    self._publish(rec)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
2024-04-27 17:19:14,144 - DEBUG - Starting new HTTPS connection (1): o151352.ingest.sentry.io:443
2024-04-27 17:19:16,144 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,144 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock

I hope the official can resolve this issue as soon as possible.

Hello @endnone ,thank you for reaching out and happy to help. Could you please provide the following to investigate further:

-SDK debug logs
-Wandb —version
-Code snippets
-Notebook environment you are currently using

@joana-marie
I’ve uploaded the script and additional details, You can access it here: wandb_debug.

Please let me know if there’s anything else you need from my end to assist further.

Additionally, some similar issues have been discussed in this thread on the wandb GitHub issues page: wandb/issues/6449. It might provide some useful insights.

Hi @endnone thank you for the detailed information. Just to clarify, is the python script in this github repo the one raising that BrokenPipe error for you? additionally, from these logs it seems that the Python process was killed due to a lack of available memory. May I please ask what’s your current RAM size, and if you’re running into similar issue for a training that doesn’t require that many resources?

The issue mainly occurs when the wandb service is automatically terminated after loading a checkpoint on the GPU. If the wandb service is not started, there is enough memory to complete the script. RAM information is as follows:

MemTotal:       792264260 kB
MemFree:        514097244 kB
MemAvailable:   733915564 kB
Buffers:          150484 kB
Cached:         222294576 kB
SwapCached:            0 kB
Active:         223218892 kB
Inactive:        6503552 kB
Active(anon):    7331772 kB
Inactive(anon):  2845412 kB
Active(file):   215887120 kB
Inactive(file):  3658140 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               140 kB
Writeback:             0 kB
AnonPages:       7278700 kB
Mapped:           912796 kB
Shmem:           2899788 kB
Slab:           22179064 kB
SReclaimable:    1695732 kB
SUnreclaim:     20483332 kB
KernelStack:       59296 kB
PageTables:        48668 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    396132128 kB
Committed_AS:   25350352 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     2931444 kB
VmallocChunk:   33683423228 kB
Percpu:           127488 kB
HardwareCorrupted:     0 kB
AnonHugePages:   2586624 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:    767213396 kB
DirectMap2M:    33757184 kB
DirectMap1G:     4194304 kB

Thank you for the additional information, @endnone it looks like that you have sufficient memory resources. I am wondering how you’re restarting service that you’ve mentioned in your original post here? also, it seems you’re in an older SDK version 0.15.5 and I was wondering if you’re still noticing the same issue after upgrading to our most recent SDK release 0.17.0?