Wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=453)

I am running some NLP models and simply using wandb to log the errors during these modelings. I am receiving the following error while logging:

wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=453)

I appreciate your help in fixing it.

Hi @faizelkhan-umn, happy to help you look into this but we will need additional info. Could you please provide the following:

  • Brief description of your experiment setup and what integrations, if any, are you using? Expand on the structure of your runs including if you are running anything in parallel or if you are using multiple GPUs.
  • Complete traceback of your error
  • Debug.log and Debug-internal.log files for the crashing runs. These are found in the working directory of the project under wandb within the specific runs folder.

Hi @faizelkhan-umn since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

I don’t know if @faizelkhan-umn fixed his issue or not, but I’m also facing the same issue.

I’m not using any special integrations, or multiple GPUs.

Here’s the trace of my error:

WARNING:root:Failed to import geometry msgs in rigid_transformations.py.
WARNING:root:Failed to import ros dependencies in rigid_transforms.py
WARNING:root:autolab_core not installed as catkin package, RigidTransform ros me
thods will be unavailable
wandb: Currently logged in as: ******. Use `wandb log
in --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.5
wandb: Run data is saved locally in ./wandb/run-20221116_134736-3a4w8w3w
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run wandering-music-644
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  (encoder): Sequential(
    (0): Linear(in_features=8, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=256, bias=True)
    (5): ReLU()
    (6): Linear(in_features=256, out_features=2, bias=True)
  (decoder): Sequential(
    (0): Linear(in_features=6, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=256, bias=True)
    (5): ReLU()
    (6): Linear(in_features=256, out_features=4, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
Loading `train_dataloader` to estimate number of stepping batches.
/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader,
train_dataloader, does not have many workers which may be a bottleneck. Consider
 increasing the value of the `num_workers` argument` (try 48 which is the number
 of cpus on this machine) in the `DataLoader` init to improve performance.

  | Name    | Type       | Params
0 | encoder | Sequential | 265 K
1 | decoder | Sequential | 265 K
2 | dropout | Dropout    | 0
531 K     Trainable params
0         Non-trainable params
531 K     Total params
2.126     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home3/shivam/miniconda3/envs/l_a/lib/python3
.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: Po
ssibleUserWarning: The dataloader, val_dataloader 0, does not have many workers
which may be a bottleneck. Consider increasing the value of the `num_workers` ar
gument` (try 48 which is the number of cpus on this machine) in the `DataLoader`
 init to improve performance.
Sanity Checking DataLoader 0:   0%|                       | 0/2 [00:00<?, ?it/s]
wandb: ERROR Failed to sample metric: p
rocess no longer exists (pid=805480)
Exception in thread MsgRouterThr:
Traceback (most recent call last):
  File "/home3/shivam/miniconda3/envs/l_a/lib/python3.10/threading.py", line 101
6, in _bootstrap_inner

As for the debug outputs:

Thanks for your help.

