requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection r

Hi all,
I am facing a problem while I am training my model.
the training crashed due to wandb run.
could anyone explain why this error happened and how can I avoid it.

And the following is the wandb debug-internal.log:

2023-07-30 21:32:45,718 INFO    StreamThr :3775545 [internal.py:wandb_internal():86] W&B internal server running at pid: 3775545, started at: 2023-07-30 21:32:45.716165
2023-07-30 21:32:45,721 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: status
2023-07-30 21:32:45,725 INFO    WriterThread:3775545 [datastore.py:open_for_write():85] open: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/run-42ltpe8i.wandb
2023-07-30 21:32:45,728 DEBUG   SenderThread:3775545 [sender.py:send():379] send: header
2023-07-30 21:32:45,800 DEBUG   SenderThread:3775545 [sender.py:send():379] send: run
2023-07-30 21:32:45,823 INFO    SenderThread:3775545 [sender.py:_maybe_setup_resume():758] checking resume status for None/YOLOX/42ltpe8i
2023-07-30 21:32:46,822 INFO    SenderThread:3775545 [dir_watcher.py:__init__():211] watching files in: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files
2023-07-30 21:32:46,822 INFO    SenderThread:3775545 [sender.py:_start_run_threads():1122] run started: 42ltpe8i with start time 1690723965.717156
2023-07-30 21:32:46,822 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: summary_record
2023-07-30 21:32:46,824 INFO    SenderThread:3775545 [sender.py:_save_file():1376] saving file wandb-summary.json with policy end
2023-07-30 21:32:46,839 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: check_version
2023-07-30 21:32:46,840 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: check_version
2023-07-30 21:32:47,276 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: run_start
2023-07-30 21:32:47,291 DEBUG   HandlerThread:3775545 [system_info.py:__init__():31] System info init
2023-07-30 21:32:47,291 DEBUG   HandlerThread:3775545 [system_info.py:__init__():46] System info init done
2023-07-30 21:32:47,291 INFO    HandlerThread:3775545 [system_monitor.py:start():181] Starting system monitor
2023-07-30 21:32:47,292 INFO    SystemMonitor:3775545 [system_monitor.py:_start():145] Starting system asset monitoring threads
2023-07-30 21:32:47,292 INFO    HandlerThread:3775545 [system_monitor.py:probe():201] Collecting system info
2023-07-30 21:32:47,292 INFO    SystemMonitor:3775545 [interfaces.py:start():190] Started cpu monitoring
2023-07-30 21:32:47,293 INFO    SystemMonitor:3775545 [interfaces.py:start():190] Started disk monitoring
2023-07-30 21:32:47,294 INFO    SystemMonitor:3775545 [interfaces.py:start():190] Started gpu monitoring
2023-07-30 21:32:47,295 INFO    SystemMonitor:3775545 [interfaces.py:start():190] Started memory monitoring
2023-07-30 21:32:47,295 INFO    SystemMonitor:3775545 [interfaces.py:start():190] Started network monitoring
2023-07-30 21:32:47,448 DEBUG   HandlerThread:3775545 [system_info.py:probe():195] Probing system
2023-07-30 21:32:47,453 DEBUG   HandlerThread:3775545 [system_info.py:_probe_git():180] Probing git
2023-07-30 21:32:47,485 DEBUG   HandlerThread:3775545 [system_info.py:_probe_git():188] Probing git done
2023-07-30 21:32:47,486 DEBUG   HandlerThread:3775545 [system_info.py:probe():240] Probing system done
2023-07-30 21:32:47,486 DEBUG   HandlerThread:3775545 [system_monitor.py:probe():210] {'os': 'Linux-5.8.0-43-generic-x86_64-with-glibc2.31', 'python': '3.9.1', 'heartbeatAt': '2023-07-30T13:32:47.449454', 'startedAt': '2023-07-30T13:32:45.684482', 'docker': None, 'cuda': None, 'args': (), 'state': 'running', 'program': '/ai/mnt/code/YOLOX/train.py', 'codePath': 'train.py', 'git': {'remote': , 'commit': 'e73eb9f8b3c0c70493bcc38f4fcf3fbdfaa6da07'}, 'email': 'x, 'root': '/ai/mnt/code/YOLOX', 'host': 'ai', 'username': 'root', 'executable': '/root/miniconda3/bin/python', 'cpu_count': 64, 'cpu_count_logical': 128, 'cpu_freq': {'current': 977.1614218749996, 'min': 800.0, 'max': 3400.0}, 'cpu_freq_per_core': [{'current': 839.44, 'min': 800.0, 'max': 3400.0}, {'current': 799.629, 'min': 800.0, 'max': 3400.0},], 'disk': {'total': 7096.5516357421875, 'used': 371.7781410217285}, 'gpu': 'NVIDIA GeForce RTX 3090', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA GeForce RTX 3090', 'memory_total': 25769803776}], 'memory': {'total': 48.0}}
2023-07-30 21:32:47,487 INFO    HandlerThread:3775545 [system_monitor.py:probe():211] Finished collecting system info
2023-07-30 21:32:47,487 INFO    HandlerThread:3775545 [system_monitor.py:probe():214] Publishing system info
2023-07-30 21:32:47,487 DEBUG   HandlerThread:3775545 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment
2023-07-30 21:32:47,489 DEBUG   HandlerThread:3775545 [system_info.py:_save_pip():67] Saving pip packages done
2023-07-30 21:32:47,489 DEBUG   HandlerThread:3775545 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment
2023-07-30 21:32:47,830 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_created():272] file/dir created: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/requirements.txt
2023-07-30 21:32:47,831 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_created():272] file/dir created: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/wandb-summary.json
2023-07-30 21:32:47,832 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_created():272] file/dir created: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/conda-environment.yaml
2023-07-30 21:32:48,829 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_modified():289] file/dir modified: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/requirements.txt
2023-07-30 21:32:48,830 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_modified():289] file/dir modified: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/conda-environment.yaml
2023-07-30 21:32:52,509 DEBUG   HandlerThread:3775545 [system_info.py:_save_conda():86] Saving conda packages done
2023-07-30 21:32:52,514 INFO    HandlerThread:3775545 [system_monitor.py:probe():216] Finished publishing system info
2023-07-30 21:32:52,526 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: status_report
2023-07-30 21:32:52,527 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: keepalive
2023-07-30 21:32:52,528 DEBUG   SenderThread:3775545 [sender.py:send():379] send: files
2023-07-30 21:32:52,528 INFO    SenderThread:3775545 [sender.py:_save_file():1376] saving file wandb-metadata.json with policy now
2023-07-30 21:32:52,541 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: stop_status
2023-07-30 21:32:52,542 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: stop_status
2023-07-30 21:32:52,835 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_created():272] file/dir created: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/wandb-metadata.json
2023-07-30 21:32:52,948 DEBUG   SenderThread:3775545 [sender.py:send():379] send: telemetry
2023-07-30 21:32:52,949 DEBUG   SenderThread:3775545 [sender.py:send():379] send: config
2023-07-30 21:32:52,950 DEBUG   SenderThread:3775545 [sender.py:send():379] send: metric
2023-07-30 21:32:52,951 DEBUG   SenderThread:3775545 [sender.py:send():379] send: telemetry
2023-07-30 21:32:52,951 DEBUG   SenderThread:3775545 [sender.py:send():379] send: metric
2023-07-30 21:32:52,951 WARNING SenderThread:3775545 [sender.py:send_metric():1327] Seen metric with glob (shouldn't happen)
2023-07-30 21:32:52,952 DEBUG   SenderThread:3775545 [sender.py:send():379] send: metric
2023-07-30 21:32:52,952 DEBUG   SenderThread:3775545 [sender.py:send():379] send: metric
2023-07-30 21:32:52,952 WARNING SenderThread:3775545 [sender.py:send_metric():1327] Seen metric with glob (shouldn't happen)
2023-07-30 21:32:53,158 ERROR   wandb-upload_0:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-production.appspot.com/scottxu-wandb/YOLOX/42ltpe8i/wandb-metadata.json?Expires=1690810372&GoogleAccessId=gorilla-files-url-signer-man%40wandb-
2023-07-30 21:32:57,252 ERROR   wandb-upload_0:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-production.appspot.com/scottxu-wandb/YOLOX/42ltpe8i/wandb-metadata.json?Expires=1690810372&GoogleAccessId=gorilla-files-url-signer-man%40wandb-'Connection reset by peer'))
2023-07-30 21:32:57,252 ERROR   wandb-upload_0:3775545 [internal_api.py:upload_file():2292] upload_file request headers: {'User-Agent': 'python-requests/2.29.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '14580'}
2023-07-30 21:32:57,253 ERROR   wandb-upload_0:3775545 [internal_api.py:upload_file():2294] upload_file response body: 
2023-07-30 21:32:57,253 INFO    wandb-upload_0:3775545 [retry.py:__call__():172] Retry attempt failed:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/root/miniconda3/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/root/miniconda3/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/root/miniconda3/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ConnectionResetError: [Errno 104] Connection reset by peer
wandb.sdk.lib.retry.TransientError: None
2023-07-30 21:33:01,487 ERROR   wandb-upload_0:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-production.appspot.com/scottxu-wandb/YOLOX/42ltpe8i/wandb-metadata.json?Expires=1690810372&GoogleAccessId=gorilla-files-url-signer-man%40wandb-: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2023-07-30 21:33:01,487 ERROR   wandb-upload_0:3775545 [internal_api.py:upload_file():2292] upload_file request headers: {'User-Agent': 'python-requests/2.29.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '14580'}
2023-07-30 21:33:01,487 ERROR   wandb-upload_0:3775545 [internal_api.py:upload_file():2294] upload_file response body: 
2023-07-30 21:33:01,955 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: status_report
2023-07-30 21:33:06,955 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: status_report
2023-07-30 21:33:07,542 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: stop_status
2023-07-30 21:33:07,543 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: stop_status
2023-07-30 21:33:10,363 INFO    wandb-upload_0:3775545 [upload_job.py:push():131] Uploaded file /tmp/tmpct3libqbwandb/uwou19xg-wandb-metadata.json
2023-07-30 21:33:12,799 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: status_report
2023-07-30 21:33:15,972 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: log_artifact
2023-07-30 21:33:15,974 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: log_artifact
2023-07-30 21:33:17,317 INFO    SenderThread:3775545 [sender.py:send_request_log_artifact():1442] logged artifact validation_images - {'id': 'QXJ0aWZhY3Q6NDg2ODcwNDEz', 'digest': 'c2b2d644acc9d8f163b2c555a6207f98', 'state': 'COMMITTED', 'aliases': [{'artifactCollectionName': 'validation_images', 'alias': 'latest'}, {'artifactCollectionName': 'validation_images', 'alias': 'v1'}], 'artifactSequence': {'id': 'QXJ0aWZhY3RDb2xsZWN0aW9uOjc2MzYyMTU4', 'latestArtifact': {'id': 'QXJ0aWZhY3Q6NDg2ODcwNDEz', 'versionIndex': 1}}, 'version': 'v1'}
2023-07-30 21:33:17,871 DEBUG   SenderThread:3775545 [sender.py:send():379] send: exit
2023-07-30 21:33:17,872 INFO    SenderThread:3775545 [sender.py:send_exit():584] handling exit code: 0
2023-07-30 21:33:17,872 INFO    SenderThread:3775545 [sender.py:send_exit():586] handling runtime: 30
2023-07-30 21:33:17,877 INFO    SenderThread:3775545 [sender.py:_save_file():1376] saving file wandb-summary.json with policy end
2023-07-30 21:33:17,877 INFO    SenderThread:3775545 [sender.py:send_exit():592] send defer
2023-07-30 21:33:17,883 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:17,886 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 0
2023-07-30 21:33:17,887 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: status_report
2023-07-30 21:33:18,326 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,327 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 0
2023-07-30 21:33:18,327 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 1
2023-07-30 21:33:18,327 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,327 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 1
2023-07-30 21:33:18,328 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,329 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 1
2023-07-30 21:33:18,329 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 2
2023-07-30 21:33:18,329 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,329 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 2
2023-07-30 21:33:18,329 INFO    HandlerThread:3775545 [system_monitor.py:finish():190] Stopping system monitor
2023-07-30 21:33:18,330 DEBUG   SystemMonitor:3775545 [system_monitor.py:_start():159] Starting system metrics aggregation loop
2023-07-30 21:33:18,330 DEBUG   SystemMonitor:3775545 [system_monitor.py:_start():166] Finished system metrics aggregation loop
2023-07-30 21:33:18,331 INFO    HandlerThread:3775545 [interfaces.py:finish():202] Joined cpu monitor
2023-07-30 21:33:18,332 DEBUG   SystemMonitor:3775545 [system_monitor.py:_start():170] Publishing last batch of metrics
2023-07-30 21:33:18,332 INFO    HandlerThread:3775545 [interfaces.py:finish():202] Joined disk monitor
2023-07-30 21:33:18,356 INFO    HandlerThread:3775545 [interfaces.py:finish():202] Joined gpu monitor
2023-07-30 21:33:18,357 INFO    HandlerThread:3775545 [interfaces.py:finish():202] Joined memory monitor
2023-07-30 21:33:18,357 INFO    HandlerThread:3775545 [interfaces.py:finish():202] Joined network monitor
2023-07-30 21:33:18,358 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,359 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 2
2023-07-30 21:33:18,359 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 3
2023-07-30 21:33:18,359 DEBUG   SenderThread:3775545 [sender.py:send():379] send: stats
2023-07-30 21:33:18,359 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,362 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 3
2023-07-30 21:33:18,363 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,363 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 3
2023-07-30 21:33:18,364 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 4
2023-07-30 21:33:18,364 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,364 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 4
2023-07-30 21:33:18,365 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,365 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 4
2023-07-30 21:33:18,365 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 5
2023-07-30 21:33:18,365 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,365 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 5
2023-07-30 21:33:18,366 DEBUG   SenderThread:3775545 [sender.py:send():379] send: summary
2023-07-30 21:33:18,367 INFO    SenderThread:3775545 [sender.py:_save_file():1376] saving file wandb-summary.json with policy end
2023-07-30 21:33:18,368 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,368 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 5
2023-07-30 21:33:18,368 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 6
2023-07-30 21:33:18,368 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,369 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 6
2023-07-30 21:33:18,369 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,369 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 6
2023-07-30 21:33:18,370 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 7
2023-07-30 21:33:18,370 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: status_report
2023-07-30 21:33:18,370 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,370 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 7
2023-07-30 21:33:18,371 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,371 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 7
2023-07-30 21:33:18,371 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 8
2023-07-30 21:33:18,371 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,371 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 8
2023-07-30 21:33:18,372 DEBUG   SenderThread:3775545 [sender.py:send_request():406] send_request: defer
2023-07-30 21:33:18,372 INFO    SenderThread:3775545 [sender.py:send_request_defer():608] handle sender defer: 8
2023-07-30 21:33:18,372 INFO    SenderThread:3775545 [job_builder.py:build():280] Attempting to build job artifact
2023-07-30 21:33:18,374 INFO    SenderThread:3775545 [job_builder.py:_get_source_type():389] is repo sourced job
2023-07-30 21:33:18,376 INFO    SenderThread:3775545 [job_builder.py:build():363] adding wandb-job metadata file
2023-07-30 21:33:18,380 INFO    SenderThread:3775545 [sender.py:transition_state():612] send defer: 9
2023-07-30 21:33:18,381 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: defer
2023-07-30 21:33:18,381 DEBUG   SenderThread:3775545 [sender.py:send():379] send: artifact
2023-07-30 21:33:18,381 INFO    HandlerThread:3775545 [handler.py:handle_request_defer():170] handle defer: 9
2023-07-30 21:33:18,871 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: poll_exit
2023-07-30 21:33:18,874 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_modified():289] file/dir modified: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/config.yaml
2023-07-30 21:33:18,874 INFO    Thread-12 :3775545 [dir_watcher.py:_on_file_modified():289] file/dir modified: /ai/mnt/code/YOLOX/wandb/run-20230730_213245-42ltpe8i/files/wandb-summary.json
2023-07-30 21:33:19,913 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-artifacts-prod/wandb_artifacts/76361891/527263450/d2b2332392269358dec15852e007353c?Expires=1690810399&GoogleAccessId=gorilla-files-url-signer-man%40wandb-: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2023-07-30 21:33:19,914 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2292] upload_file request headers: {'User-Agent': 'python-requests/2.29.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-MD5': '0rIzI5Imk1jewVhS4Ac1PA==', 'Content-Type': 'application/json', 'Content-Length': '10080'}
2023-07-30 21:33:19,914 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2294] upload_file response body: 
2023-07-30 21:33:20,338 INFO    wandb-upload_0:3775545 [upload_job.py:push():89] Uploaded file /root/.local/share/wandb/artifacts/staging/tmpwnmf_iz9
2023-07-30 21:33:21,197 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-artifacts-prod/wandb_artifacts/76361891/527263450/d2b2332392269358dec15852e007353c?Expires=1690810399&GoogleAccessId=gorilla-files-url-signer-man%40wandb-: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2023-07-30 21:33:21,198 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2292] upload_file request headers: {'User-Agent': 'python-requests/2.29.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-MD5': '0rIzI5Imk1jewVhS4Ac1PA==', 'Content-Type': 'application/json', 'Content-Length': '10080'}
2023-07-30 21:33:21,198 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2294] upload_file response body: 
2023-07-30 21:33:23,494 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-artifacts-prod/wandb_artifacts/76361891/527263450/d2b2332392269358dec15852e007353c?Expires=1690810399&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=X6HJ%2FHtwdMO8K9ht21fWYUr0NtQOY9yxvPcSRDxVaLmB%2BKZOCwqtaAikdVU8q5I%2BGWhRQLhSo1UV6mN4FLlQm2F2a%2BJ06VR5IsjsywA1iWDNMSh2LEl3aNO8edgmr%2FSl4nSKzB4zdTbXvTIQhHEMTWPRnm6miS%2BhfWF0cPYCXegQIjigHPXAI%2Bm3Xp08jlha8CEEJyF9rA80eEOU5abXcnUojNtY9oCaUkxveEKzCD45f7ZBkmzL4na66Zvpi41xfTAuKYADYAL9rrHIf4y%2Bs2hzeBb3SG5cU91brfw7NZylL9kj1%2Frt7eZvR9hUNVGM%2BAy9NaLFln8BYKwvAOXusQ%3D%3D: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2023-07-30 21:33:23,495 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2292] upload_file request headers: {'User-Agent': 'python-requests/2.29.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-MD5': '0rIzI5Imk1jewVhS4Ac1PA==', 'Content-Type': 'application/json', 'Content-Length': '10080'}
2023-07-30 21:33:23,495 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2294] upload_file response body: 
2023-07-30 21:33:23,495 INFO    wandb-upload_1:3775545 [retry.py:__call__():172] Retry attempt failed:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/root/miniconda3/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/root/miniconda3/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/root/miniconda3/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/requests/adapters.py", line 487, in send
    resp = conn.urlopen(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/root/miniconda3/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/root/miniconda3/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/root/miniconda3/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/root/miniconda3/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/wandb/sdk/internal/internal_api.py", line 2285, in upload_file
    response = self._upload_file_session.put(
  File "/root/miniconda3/lib/python3.9/site-packages/requests/sessions.py", line 647, in put
    return self.request("PUT", url, data=data, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/requests/adapters.py", line 502, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

2023-07-30 21:33:23,873 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: keepalive
2023-07-30 21:33:28,495 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-artifacts-prod/wandb_artifacts/76361891/527263450/d2b2332392269358dec15852e007353c?Expires=1690810399&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=X6HJ%2FHtwdMO8K9ht21fWYUr0NtQOY9yxvPcSRDxVaLmB%2BKZOCwqtaAikdVU8q5I%2BGWhRQLhSo1UV6mN4FLlQm2F2a%2BJ06VR5IsjsywA1iWDNMSh2LEl3aNO8edgmr%2FSl4nSKzB4zdTbXvTIQhHEMTWPRnm6miS%2BhfWF0cPYCXegQIjigHPXAI%2Bm3Xp08jlha8CEEJyF9rA80eEOU5abXcnUojNtY9oCaUkxveEKzCD45f7ZBkmzL4na66Zvpi41xfTAuKYADYAL9rrHIf4y%2Bs2hzeBb3SG5cU91brfw7NZylL9kj1%2Frt7eZvR9hUNVGM%2BAy9NaLFln8BYKwvAOXusQ%3D%3D: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2023-07-30 21:33:28,495 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2292] upload_file request headers: {'User-Agent': 'python-requests/2.29.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-MD5': '0rIzI5Imk1jewVhS4Ac1PA==', 'Content-Type': 'application/json', 'Content-Length': '10080'}
2023-07-30 21:33:28,495 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2294] upload_file response body: 
2023-07-30 21:33:28,875 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: keepalive
2023-07-30 21:33:33,878 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: keepalive
2023-07-30 21:33:38,552 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2290] upload_file exception https://storage.googleapis.com/wandb-artifacts-prod/wandb_artifacts/76361891/527263450/d2b2332392269358dec15852e007353c?Expires=1690810399&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=X6HJ%2FHtwdMO8K9ht21fWYUr0NtQOY9yxvPcSRDxVaLmB%2BKZOCwqtaAikdVU8q5I%2BGWhRQLhSo1UV6mN4FLlQm2F2a%2BJ06VR5IsjsywA1iWDNMSh2LEl3aNO8edgmr%2FSl4nSKzB4zdTbXvTIQhHEMTWPRnm6miS%2BhfWF0cPYCXegQIjigHPXAI%2Bm3Xp08jlha8CEEJyF9rA80eEOU5abXcnUojNtY9oCaUkxveEKzCD45f7ZBkmzL4na66Zvpi41xfTAuKYADYAL9rrHIf4y%2Bs2hzeBb3SG5cU91brfw7NZylL9kj1%2Frt7eZvR9hUNVGM%2BAy9NaLFln8BYKwvAOXusQ%3D%3D: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2023-07-30 21:33:38,554 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2292] upload_file request headers: {'User-Agent': 'python-requests/2.29.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-MD5': '0rIzI5Imk1jewVhS4Ac1PA==', 'Content-Type': 'application/json', 'Content-Length': '10080'}
2023-07-30 21:33:38,554 ERROR   wandb-upload_1:3775545 [internal_api.py:upload_file():2294] upload_file response body: 
2023-07-30 21:33:38,880 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: keepalive
2023-07-30 21:33:43,882 DEBUG   HandlerThread:3775545 [handler.py:handle_request():144] handle_request: keepalive

Thank you!

Hi @scottxu-wandb, could you let me know what version of wandb you are using?

Also, have you ran any runs since this? It looks like the error is coming from the GCP API when we attempt to upload files to the bucket.

From the looks of it, it seems like we may need to catch this error and add a retry if this happens.

Hi, thanks for reply ~~ i have tried 0.15.3 and 0.15.7, both the same. Also, since this error appeared, i tried several runs, but it showed the same error.

Thanks @scottxu-wandb. Are you going through a proxy or anything out of the ordinary on your network?

no, i do not use any proxy, and i can ping through wandb.ai

Hi @scottxu-wandb ,

Following up on this thread. From your debug log, there is a persistent issue with the file upload process due to connection reset errors. The error occurs at different layers of the networking stack, from urllib3 to requests to wandb library. This suggests that the issue might be related to the server’s response to the upload requests or other networking issues affecting the communication between the client and the server. The common thread here is every time you attempt to upload files. Could you share an example file/folder (send to mohammad.bakir@wandb.com) you are attempting to upload and a code snippet of how you are uploading to wandb. Thanks

Hi @scottxu-wandb , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.