Wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=453)

I am running some NLP models and simply using wandb to log the errors during these modelings. I am receiving the following error while logging:

wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=453)

I appreciate your help in fixing it.

Hi @faizelkhan-umn, happy to help you look into this but we will need additional info. Could you please provide the following:

  • Brief description of your experiment setup and what integrations, if any, are you using? Expand on the structure of your runs including if you are running anything in parallel or if you are using multiple GPUs.
  • Complete traceback of your error
  • Debug.log and Debug-internal.log files for the crashing runs. These are found in the working directory of the project under wandb within the specific runs folder.

Hi @faizelkhan-umn since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

I don’t know if @faizelkhan-umn fixed his issue or not, but I’m also facing the same issue.

I’m not using any special integrations, or multiple GPUs.

Here’s the trace of my error:

WARNING:root:Failed to import geometry msgs in rigid_transformations.py.
WARNING:root:Failed to import ros dependencies in rigid_transforms.py
WARNING:root:autolab_core not installed as catkin package, RigidTransform ros me
thods will be unavailable
wandb: Currently logged in as: ******. Use `wandb log
in --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.5
wandb: Run data is saved locally in ./wandb/run-20221116_134736-3a4w8w3w
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run wandering-music-644
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
ConditionalAE(
  (encoder): Sequential(
    (0): Linear(in_features=8, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=256, bias=True)
    (5): ReLU()
    (6): Linear(in_features=256, out_features=2, bias=True)
  )
  (decoder): Sequential(
    (0): Linear(in_features=6, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=256, bias=True)
    (5): ReLU()
    (6): Linear(in_features=256, out_features=4, bias=True)
  )
  (dropout): Dropout(p=0.5, inplace=False)
)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
Loading `train_dataloader` to estimate number of stepping batches.
/home3/shivam/miniconda3/envs/l_a/lib/python3.10/site-packages/pytorch_lightning
/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader,
train_dataloader, does not have many workers which may be a bottleneck. Consider
 increasing the value of the `num_workers` argument` (try 48 which is the number
 of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(

  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 265 K
1 | decoder | Sequential | 265 K
2 | dropout | Dropout    | 0
---------------------------------------
531 K     Trainable params
0         Non-trainable params
531 K     Total params
2.126     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home3/shivam/miniconda3/envs/l_a/lib/python3
.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: Po
ssibleUserWarning: The dataloader, val_dataloader 0, does not have many workers
which may be a bottleneck. Consider increasing the value of the `num_workers` ar
gument` (try 48 which is the number of cpus on this machine) in the `DataLoader`
 init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0%|                       | 0/2 [00:00<?, ?it/s]
Killed
wandb: ERROR Failed to sample metric: p
rocess no longer exists (pid=805480)
Exception in thread MsgRouterThr:
Traceback (most recent call last):
  File "/home3/shivam/miniconda3/envs/l_a/lib/python3.10/threading.py", line 101
6, in _bootstrap_inner

As for the debug outputs:
debug.log

2022-11-16 13:47:36,086 INFO    MainThread:805480 [wandb_setup.py:_flush():68] Configure stats pid to 805480
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_setup.py:_flush():68] Loading settings from /home3/shivam/.config/wandb/settings
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_setup.py:_flush():68] Loading settings from /home3/shivam/latent-actions/wandb/settings
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_setup.py:_flush():68] Loading settings from environment variables: {'_require_service': 'True'}
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_setup.py:_flush():68] Inferring run settings from compute environment: {'program_relpath': 'train.py', 'program': '/home3/shivam/latent-actions/train.py'}
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_init.py:_log_setup():476] Logging user logs to ./wandb/run-20221116_134736-3a4w8w3w/logs/debug.log
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_init.py:_log_setup():477] Logging internal logs to ./wandb/run-20221116_134736-3a4w8w3w/logs/debug-internal.log
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_init.py:init():516] calling init triggers
2022-11-16 13:47:36,087 INFO    MainThread:805480 [wandb_init.py:init():519] wandb.init called with sweep_config: {}
config: {}
2022-11-16 13:47:36,088 INFO    MainThread:805480 [wandb_init.py:init():569] starting backend
2022-11-16 13:47:36,088 INFO    MainThread:805480 [wandb_init.py:init():573] setting up manager
2022-11-16 13:47:36,091 INFO    MainThread:805480 [backend.py:_multiprocessing_setup():102] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2022-11-16 13:47:36,096 INFO    MainThread:805480 [wandb_init.py:init():580] backend started and connected
2022-11-16 13:47:36,099 INFO    MainThread:805480 [wandb_init.py:init():658] updated telemetry
2022-11-16 13:47:36,104 INFO    MainThread:805480 [wandb_init.py:init():693] communicating run to backend with 60 second timeout
2022-11-16 13:47:36,367 INFO    MainThread:805480 [wandb_run.py:_on_init():2000] communicating current version
2022-11-16 13:47:36,409 INFO    MainThread:805480 [wandb_run.py:_on_init():2004] got version response 
2022-11-16 13:47:36,409 INFO    MainThread:805480 [wandb_init.py:init():728] starting run threads in backend
2022-11-16 13:47:38,000 INFO    MainThread:805480 [wandb_run.py:_console_start():1980] atexit reg
2022-11-16 13:47:38,000 INFO    MainThread:805480 [wandb_run.py:_redirect():1838] redirect: SettingsConsole.WRAP_RAW
2022-11-16 13:47:38,000 INFO    MainThread:805480 [wandb_run.py:_redirect():1903] Wrapping output streams.
2022-11-16 13:47:38,000 INFO    MainThread:805480 [wandb_run.py:_redirect():1925] Redirects installed.
2022-11-16 13:47:38,001 INFO    MainThread:805480 [wandb_init.py:init():765] run started, returning control to user process
2022-11-16 13:47:42,335 INFO    MainThread:805480 [wandb_run.py:_config_callback():1160] config_cb None None {'total_parameters': 531462, 'trainable_parameters': 531462, 'dataset_size': 1976600}
2022-11-16 13:47:42,731 INFO    MainThread:805480 [wandb_run.py:_config_callback():1160] config_cb None None {'latent_dim': 2, 'enc_dims': [256, 512, 256], 'dec_dims': [256, 512, 256], 'lr': 0.01, 'kl_coeff': 1.0, 'kl_schedule': 'cyclical', 'activation': 'relu', 'context_dim': 4, 'action_dim': 4, 'include_joint_angles': False, 'fixed_point_coeff': 0.001, 'dropout': 0.5, 'compute_divergence': False, 'div_coeff': 0, 'div_clip': inf, 'decode': True, 'align': False, 'model_class': 'cAE', 'batch_size': 64, 'max_epochs': 100, 'no_wandb': False, 'data_path': 'data/rpnp_traj_with_pose_and_noise.pkl', 'exclude_context_feature_joint_angles': True, 'exclude_context_feature_gripper_width': False, 'exclude_context_feature_pose': True, 'exclude_context_feature_ee_pos': False, 'exclude_context_feature_ee_rot': True, 'exclude_gripper': False, 'action_space': 'ee', 'size_limit': 'None'}

debug-internal.log

2022-11-16 13:47:36,096 INFO    StreamThr :824981 [internal.py:wandb_internal():88] W&B internal server running at pid: 824981, started at: 2022-11-16 13:47:36.096186
2022-11-16 13:47:36,105 DEBUG   HandlerThread:824981 [handler.py:handle_request():139] handle_request: status
2022-11-16 13:47:36,105 DEBUG   SenderThread:824981 [sender.py:send_request():317] send_request: status
2022-11-16 13:47:36,106 INFO    WriterThread:824981 [datastore.py:open_for_write():75] open: ./wandb/run-20221116_134736-3a4w8w3w/run-3a4w8w3w.wandb
2022-11-16 13:47:36,106 DEBUG   SenderThread:824981 [sender.py:send():303] send: header
2022-11-16 13:47:36,107 DEBUG   SenderThread:824981 [sender.py:send():303] send: run
2022-11-16 13:47:36,108 INFO    SenderThread:824981 [sender.py:_maybe_setup_resume():593] checking resume status for ucla-ncel-robotics/latent-action/3a4w8w3w
2022-11-16 13:47:36,368 DEBUG   HandlerThread:824981 [handler.py:handle_request():139] handle_request: check_version
2022-11-16 13:47:36,374 INFO    SenderThread:824981 [dir_watcher.py:__init__():216] watching files in: ./wandb/run-20221116_134736-3a4w8w3w/files
2022-11-16 13:47:36,375 INFO    SenderThread:824981 [sender.py:_start_run_threads():928] run started: 3a4w8w3w with start time 1668635256.096683
2022-11-16 13:47:36,375 DEBUG   SenderThread:824981 [sender.py:send():303] send: summary
2022-11-16 13:47:36,375 INFO    SenderThread:824981 [sender.py:_save_file():1171] saving file wandb-summary.json with policy end
2022-11-16 13:47:36,375 DEBUG   SenderThread:824981 [sender.py:send_request():317] send_request: check_version
2022-11-16 13:47:36,414 DEBUG   HandlerThread:824981 [handler.py:handle_request():139] handle_request: run_start
2022-11-16 13:47:36,425 DEBUG   HandlerThread:824981 [system_info.py:__init__():31] System info init
2022-11-16 13:47:36,425 DEBUG   HandlerThread:824981 [system_info.py:__init__():46] System info init done
2022-11-16 13:47:36,425 INFO    HandlerThread:824981 [system_monitor.py:start():150] Starting system monitor
2022-11-16 13:47:36,425 INFO    SystemMonitor:824981 [system_monitor.py:_start():116] Starting system asset monitoring threads
2022-11-16 13:47:36,425 INFO    SystemMonitor:824981 [interfaces.py:start():168] Started cpu
2022-11-16 13:47:36,425 INFO    HandlerThread:824981 [system_monitor.py:probe():168] Collecting system info
2022-11-16 13:47:36,426 INFO    SystemMonitor:824981 [interfaces.py:start():168] Started disk
2022-11-16 13:47:36,426 INFO    SystemMonitor:824981 [interfaces.py:start():168] Started gpu
2022-11-16 13:47:36,427 INFO    SystemMonitor:824981 [interfaces.py:start():168] Started memory
2022-11-16 13:47:36,427 INFO    SystemMonitor:824981 [interfaces.py:start():168] Started network
2022-11-16 13:47:36,465 DEBUG   HandlerThread:824981 [system_info.py:probe():195] Probing system
2022-11-16 13:47:36,467 DEBUG   HandlerThread:824981 [system_info.py:_probe_git():180] Probing git
2022-11-16 13:47:36,471 DEBUG   HandlerThread:824981 [system_info.py:_probe_git():188] Probing git done
2022-11-16 13:47:36,471 DEBUG   HandlerThread:824981 [system_info.py:probe():241] Probing system done
2022-11-16 13:47:36,471 DEBUG   HandlerThread:824981 [system_monitor.py:probe():177] {'os': 'Linux-5.15.0-48-generic-x86_64-with-glibc2.31', 'python': '3.10.6', 'heartbeatAt': '2022-11-16T21:47:36.465175', 'startedAt': '2022-11-16T21:47:36.083219', 'docker': None, 'cuda': None, 'args': ('--decode', '--model_class', 'cAE', '--max_epochs', '100', '--data_path', 'data/rpnp_traj_with_pose_and_noise.pkl', '--action_space', 'ee', '--enc_dims', '256', '512', '256', '--dec_dims', '256', '512', '256', '--exclude_context_feature_pose', '--exclude_context_feature_joint_angles', '--exclude_context_feature_ee_rot', '--dropout', '0.5', '--fixed_point_coeff', '0.001', '--batch_size', '64'), 'state': 'running', 'program': '/home3/shivam/latent-actions/train.py', 'codePath': 'train.py', 'git': {'remote': 'https://github.com/shivampatel712/latent-actions', 'commit': '1d68621ead248ab96deb84c6420a39aca3d9081c'}, 'email': 'shivambpatel712@gmail.com', 'root': '/home3/shivam/latent-actions', 'host': 'obiwan', 'username': 'shivam', 'executable': '/home3/shivam/miniconda3/envs/l_a/bin/python', 'cpu_count': 24, 'cpu_count_logical': 48, 'cpu_freq': {'current': 4000.9778541666674, 'min': 2200.0, 'max': 3800.0}, 'cpu_freq_per_core': [{'current': 4006.182, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.044, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.846, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.476, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.603, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.54, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.326, 'min': 2200.0, 'max': 3800.0}, {'current': 3800.0, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.215, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.614, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.054, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.672, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.822, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.662, 'min': 2200.0, 'max': 3800.0}, {'current': 3996.473, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.573, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.093, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.465, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.39, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.244, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.625, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.173, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.033, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.174, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.828, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.708, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.514, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.304, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.476, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.415, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.905, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.451, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.746, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.435, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.914, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.545, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.094, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.284, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.247, 'min': 2200.0, 'max': 3800.0}, {'current': 4008.135, 'min': 2200.0, 'max': 3800.0}, {'current': 3996.988, 'min': 2200.0, 'max': 3800.0}, {'current': 4008.027, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.263, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.791, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.211, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.831, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.694, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.832, 'min': 2200.0, 'max': 3800.0}], 'disk': {'total': 915.3232879638672, 'used': 52.240882873535156}, 'gpu': 'NVIDIA GeForce RTX 3090', 'gpu_count': 3, 'gpu_devices': [{'name': 'NVIDIA GeForce RTX 3090', 'memory_total': 25447170048}, {'name': 'NVIDIA GeForce RTX 3090', 'memory_total': 25447170048}, {'name': 'NVIDIA GeForce RTX 3090', 'memory_total': 25438322688}], 'memory': {'total': 125.64818572998047}}
2022-11-16 13:47:36,471 INFO    HandlerThread:824981 [system_monitor.py:probe():178] Finished collecting system info
2022-11-16 13:47:36,471 INFO    HandlerThread:824981 [system_monitor.py:probe():181] Publishing system info
2022-11-16 13:47:36,471 DEBUG   HandlerThread:824981 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment
2022-11-16 13:47:36,471 DEBUG   HandlerThread:824981 [system_info.py:_save_pip():67] Saving pip packages done
2022-11-16 13:47:36,472 DEBUG   HandlerThread:824981 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment
2022-11-16 13:47:37,375 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/wandb-summary.json
2022-11-16 13:47:37,376 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/requirements.txt
2022-11-16 13:47:37,376 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/conda-environment.yaml
2022-11-16 13:47:37,926 DEBUG   HandlerThread:824981 [system_info.py:_save_conda():86] Saving conda packages done
2022-11-16 13:47:37,926 INFO    HandlerThread:824981 [system_monitor.py:probe():183] Finished publishing system info
2022-11-16 13:47:37,995 DEBUG   SenderThread:824981 [sender.py:send():303] send: files
2022-11-16 13:47:37,996 INFO    SenderThread:824981 [sender.py:_save_file():1171] saving file wandb-metadata.json with policy now
2022-11-16 13:47:38,000 DEBUG   HandlerThread:824981 [handler.py:handle_request():139] handle_request: stop_status
2022-11-16 13:47:38,000 DEBUG   SenderThread:824981 [sender.py:send_request():317] send_request: stop_status
2022-11-16 13:47:38,088 DEBUG   SenderThread:824981 [sender.py:send():303] send: telemetry
2022-11-16 13:47:38,088 DEBUG   SenderThread:824981 [sender.py:send():303] send: metric
2022-11-16 13:47:38,088 DEBUG   SenderThread:824981 [sender.py:send():303] send: telemetry
2022-11-16 13:47:38,088 DEBUG   SenderThread:824981 [sender.py:send():303] send: metric
2022-11-16 13:47:38,088 WARNING SenderThread:824981 [sender.py:send_metric():1127] Seen metric with glob (shouldn't happen)
2022-11-16 13:47:38,346 INFO    Thread-16 :824981 [upload_job.py:push():143] Uploaded file /tmp/tmp10ff2xhawandb/byh38mac-wandb-metadata.json
2022-11-16 13:47:38,375 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/conda-environment.yaml
2022-11-16 13:47:38,375 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/wandb-metadata.json
2022-11-16 13:47:41,387 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/output.log
2022-11-16 13:47:42,336 DEBUG   SenderThread:824981 [sender.py:send():303] send: config
2022-11-16 13:47:42,732 DEBUG   SenderThread:824981 [sender.py:send():303] send: config
2022-11-16 13:47:43,403 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/output.log
2022-11-16 13:48:53,773 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/output.log
2022-11-16 13:48:53,775 DEBUG   SystemMonitor:824981 [system_monitor.py:_start():130] Starting system metrics aggregation loop
2022-11-16 13:48:54,755 WARNING StreamThr :824981 [internal.py:is_dead():385] Internal process exiting, parent pid 805480 disappeared
2022-11-16 13:48:54,755 ERROR   StreamThr :824981 [internal.py:wandb_internal():147] Internal process shutdown.
2022-11-16 13:48:54,773 INFO    Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/config.yaml
2022-11-16 13:48:54,887 INFO    SenderThread:824981 [sender.py:finish():1331] shutting down sender
2022-11-16 13:48:55,490 INFO    HandlerThread:824981 [handler.py:finish():814] shutting down handler
2022-11-16 13:48:55,739 INFO    WriterThread:824981 [datastore.py:close():279] close: ./wandb/run-20221116_134736-3a4w8w3w/run-3a4w8w3w.wandb
2022-11-16 13:48:56,227 INFO    SenderThread:824981 [dir_watcher.py:finish():362] shutting down directory watcher
2022-11-16 13:48:56,769 INFO    MainThread:824981 [internal.py:handle_exit():78] Internal process exited

Thanks for your help.