I don’t know if @faizelkhan-umn fixed his issue or not, but I’m also facing the same issue.
I’m not using any special integrations, or multiple GPUs.
Here’s the trace of my error:
WARNING:root:Failed to import geometry msgs in rigid_transformations.py.
WARNING:root:Failed to import ros dependencies in rigid_transforms.py
WARNING:root:autolab_core not installed as catkin package, RigidTransform ros me
thods will be unavailable
wandb: Currently logged in as: ******. Use `wandb log
in --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.5
wandb: Run data is saved locally in ./wandb/run-20221116_134736-3a4w8w3w
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run wandering-music-644
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
ConditionalAE(
(encoder): Sequential(
(0): Linear(in_features=8, out_features=256, bias=True)
(1): ReLU()
(2): Linear(in_features=256, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=256, bias=True)
(5): ReLU()
(6): Linear(in_features=256, out_features=2, bias=True)
)
(decoder): Sequential(
(0): Linear(in_features=6, out_features=256, bias=True)
(1): ReLU()
(2): Linear(in_features=256, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=256, bias=True)
(5): ReLU()
(6): Linear(in_features=256, out_features=4, bias=True)
)
(dropout): Dropout(p=0.5, inplace=False)
)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
Loading `train_dataloader` to estimate number of stepping batches.
/home3/shivam/miniconda3/envs/l_a/lib/python3.10/site-packages/pytorch_lightning
/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader,
train_dataloader, does not have many workers which may be a bottleneck. Consider
increasing the value of the `num_workers` argument` (try 48 which is the number
of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
| Name | Type | Params
---------------------------------------
0 | encoder | Sequential | 265 K
1 | decoder | Sequential | 265 K
2 | dropout | Dropout | 0
---------------------------------------
531 K Trainable params
0 Non-trainable params
531 K Total params
2.126 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home3/shivam/miniconda3/envs/l_a/lib/python3
.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: Po
ssibleUserWarning: The dataloader, val_dataloader 0, does not have many workers
which may be a bottleneck. Consider increasing the value of the `num_workers` ar
gument` (try 48 which is the number of cpus on this machine) in the `DataLoader`
init to improve performance.
rank_zero_warn(
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
Killed
wandb: ERROR Failed to sample metric: p
rocess no longer exists (pid=805480)
Exception in thread MsgRouterThr:
Traceback (most recent call last):
File "/home3/shivam/miniconda3/envs/l_a/lib/python3.10/threading.py", line 101
6, in _bootstrap_inner
As for the debug outputs:
debug.log
2022-11-16 13:47:36,086 INFO MainThread:805480 [wandb_setup.py:_flush():68] Configure stats pid to 805480
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_setup.py:_flush():68] Loading settings from /home3/shivam/.config/wandb/settings
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_setup.py:_flush():68] Loading settings from /home3/shivam/latent-actions/wandb/settings
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_setup.py:_flush():68] Loading settings from environment variables: {'_require_service': 'True'}
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_setup.py:_flush():68] Inferring run settings from compute environment: {'program_relpath': 'train.py', 'program': '/home3/shivam/latent-actions/train.py'}
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_init.py:_log_setup():476] Logging user logs to ./wandb/run-20221116_134736-3a4w8w3w/logs/debug.log
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_init.py:_log_setup():477] Logging internal logs to ./wandb/run-20221116_134736-3a4w8w3w/logs/debug-internal.log
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_init.py:init():516] calling init triggers
2022-11-16 13:47:36,087 INFO MainThread:805480 [wandb_init.py:init():519] wandb.init called with sweep_config: {}
config: {}
2022-11-16 13:47:36,088 INFO MainThread:805480 [wandb_init.py:init():569] starting backend
2022-11-16 13:47:36,088 INFO MainThread:805480 [wandb_init.py:init():573] setting up manager
2022-11-16 13:47:36,091 INFO MainThread:805480 [backend.py:_multiprocessing_setup():102] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2022-11-16 13:47:36,096 INFO MainThread:805480 [wandb_init.py:init():580] backend started and connected
2022-11-16 13:47:36,099 INFO MainThread:805480 [wandb_init.py:init():658] updated telemetry
2022-11-16 13:47:36,104 INFO MainThread:805480 [wandb_init.py:init():693] communicating run to backend with 60 second timeout
2022-11-16 13:47:36,367 INFO MainThread:805480 [wandb_run.py:_on_init():2000] communicating current version
2022-11-16 13:47:36,409 INFO MainThread:805480 [wandb_run.py:_on_init():2004] got version response
2022-11-16 13:47:36,409 INFO MainThread:805480 [wandb_init.py:init():728] starting run threads in backend
2022-11-16 13:47:38,000 INFO MainThread:805480 [wandb_run.py:_console_start():1980] atexit reg
2022-11-16 13:47:38,000 INFO MainThread:805480 [wandb_run.py:_redirect():1838] redirect: SettingsConsole.WRAP_RAW
2022-11-16 13:47:38,000 INFO MainThread:805480 [wandb_run.py:_redirect():1903] Wrapping output streams.
2022-11-16 13:47:38,000 INFO MainThread:805480 [wandb_run.py:_redirect():1925] Redirects installed.
2022-11-16 13:47:38,001 INFO MainThread:805480 [wandb_init.py:init():765] run started, returning control to user process
2022-11-16 13:47:42,335 INFO MainThread:805480 [wandb_run.py:_config_callback():1160] config_cb None None {'total_parameters': 531462, 'trainable_parameters': 531462, 'dataset_size': 1976600}
2022-11-16 13:47:42,731 INFO MainThread:805480 [wandb_run.py:_config_callback():1160] config_cb None None {'latent_dim': 2, 'enc_dims': [256, 512, 256], 'dec_dims': [256, 512, 256], 'lr': 0.01, 'kl_coeff': 1.0, 'kl_schedule': 'cyclical', 'activation': 'relu', 'context_dim': 4, 'action_dim': 4, 'include_joint_angles': False, 'fixed_point_coeff': 0.001, 'dropout': 0.5, 'compute_divergence': False, 'div_coeff': 0, 'div_clip': inf, 'decode': True, 'align': False, 'model_class': 'cAE', 'batch_size': 64, 'max_epochs': 100, 'no_wandb': False, 'data_path': 'data/rpnp_traj_with_pose_and_noise.pkl', 'exclude_context_feature_joint_angles': True, 'exclude_context_feature_gripper_width': False, 'exclude_context_feature_pose': True, 'exclude_context_feature_ee_pos': False, 'exclude_context_feature_ee_rot': True, 'exclude_gripper': False, 'action_space': 'ee', 'size_limit': 'None'}
debug-internal.log
2022-11-16 13:47:36,096 INFO StreamThr :824981 [internal.py:wandb_internal():88] W&B internal server running at pid: 824981, started at: 2022-11-16 13:47:36.096186
2022-11-16 13:47:36,105 DEBUG HandlerThread:824981 [handler.py:handle_request():139] handle_request: status
2022-11-16 13:47:36,105 DEBUG SenderThread:824981 [sender.py:send_request():317] send_request: status
2022-11-16 13:47:36,106 INFO WriterThread:824981 [datastore.py:open_for_write():75] open: ./wandb/run-20221116_134736-3a4w8w3w/run-3a4w8w3w.wandb
2022-11-16 13:47:36,106 DEBUG SenderThread:824981 [sender.py:send():303] send: header
2022-11-16 13:47:36,107 DEBUG SenderThread:824981 [sender.py:send():303] send: run
2022-11-16 13:47:36,108 INFO SenderThread:824981 [sender.py:_maybe_setup_resume():593] checking resume status for ucla-ncel-robotics/latent-action/3a4w8w3w
2022-11-16 13:47:36,368 DEBUG HandlerThread:824981 [handler.py:handle_request():139] handle_request: check_version
2022-11-16 13:47:36,374 INFO SenderThread:824981 [dir_watcher.py:__init__():216] watching files in: ./wandb/run-20221116_134736-3a4w8w3w/files
2022-11-16 13:47:36,375 INFO SenderThread:824981 [sender.py:_start_run_threads():928] run started: 3a4w8w3w with start time 1668635256.096683
2022-11-16 13:47:36,375 DEBUG SenderThread:824981 [sender.py:send():303] send: summary
2022-11-16 13:47:36,375 INFO SenderThread:824981 [sender.py:_save_file():1171] saving file wandb-summary.json with policy end
2022-11-16 13:47:36,375 DEBUG SenderThread:824981 [sender.py:send_request():317] send_request: check_version
2022-11-16 13:47:36,414 DEBUG HandlerThread:824981 [handler.py:handle_request():139] handle_request: run_start
2022-11-16 13:47:36,425 DEBUG HandlerThread:824981 [system_info.py:__init__():31] System info init
2022-11-16 13:47:36,425 DEBUG HandlerThread:824981 [system_info.py:__init__():46] System info init done
2022-11-16 13:47:36,425 INFO HandlerThread:824981 [system_monitor.py:start():150] Starting system monitor
2022-11-16 13:47:36,425 INFO SystemMonitor:824981 [system_monitor.py:_start():116] Starting system asset monitoring threads
2022-11-16 13:47:36,425 INFO SystemMonitor:824981 [interfaces.py:start():168] Started cpu
2022-11-16 13:47:36,425 INFO HandlerThread:824981 [system_monitor.py:probe():168] Collecting system info
2022-11-16 13:47:36,426 INFO SystemMonitor:824981 [interfaces.py:start():168] Started disk
2022-11-16 13:47:36,426 INFO SystemMonitor:824981 [interfaces.py:start():168] Started gpu
2022-11-16 13:47:36,427 INFO SystemMonitor:824981 [interfaces.py:start():168] Started memory
2022-11-16 13:47:36,427 INFO SystemMonitor:824981 [interfaces.py:start():168] Started network
2022-11-16 13:47:36,465 DEBUG HandlerThread:824981 [system_info.py:probe():195] Probing system
2022-11-16 13:47:36,467 DEBUG HandlerThread:824981 [system_info.py:_probe_git():180] Probing git
2022-11-16 13:47:36,471 DEBUG HandlerThread:824981 [system_info.py:_probe_git():188] Probing git done
2022-11-16 13:47:36,471 DEBUG HandlerThread:824981 [system_info.py:probe():241] Probing system done
2022-11-16 13:47:36,471 DEBUG HandlerThread:824981 [system_monitor.py:probe():177] {'os': 'Linux-5.15.0-48-generic-x86_64-with-glibc2.31', 'python': '3.10.6', 'heartbeatAt': '2022-11-16T21:47:36.465175', 'startedAt': '2022-11-16T21:47:36.083219', 'docker': None, 'cuda': None, 'args': ('--decode', '--model_class', 'cAE', '--max_epochs', '100', '--data_path', 'data/rpnp_traj_with_pose_and_noise.pkl', '--action_space', 'ee', '--enc_dims', '256', '512', '256', '--dec_dims', '256', '512', '256', '--exclude_context_feature_pose', '--exclude_context_feature_joint_angles', '--exclude_context_feature_ee_rot', '--dropout', '0.5', '--fixed_point_coeff', '0.001', '--batch_size', '64'), 'state': 'running', 'program': '/home3/shivam/latent-actions/train.py', 'codePath': 'train.py', 'git': {'remote': 'https://github.com/shivampatel712/latent-actions', 'commit': '1d68621ead248ab96deb84c6420a39aca3d9081c'}, 'email': 'shivambpatel712@gmail.com', 'root': '/home3/shivam/latent-actions', 'host': 'obiwan', 'username': 'shivam', 'executable': '/home3/shivam/miniconda3/envs/l_a/bin/python', 'cpu_count': 24, 'cpu_count_logical': 48, 'cpu_freq': {'current': 4000.9778541666674, 'min': 2200.0, 'max': 3800.0}, 'cpu_freq_per_core': [{'current': 4006.182, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.044, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.846, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.476, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.603, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.54, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.326, 'min': 2200.0, 'max': 3800.0}, {'current': 3800.0, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.215, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.614, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.054, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.672, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.822, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.662, 'min': 2200.0, 'max': 3800.0}, {'current': 3996.473, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.573, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.093, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.465, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.39, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.244, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.625, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.173, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.033, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.174, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.828, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.708, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.514, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.304, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.476, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.415, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.905, 'min': 2200.0, 'max': 3800.0}, {'current': 4005.451, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.746, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.435, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.914, 'min': 2200.0, 'max': 3800.0}, {'current': 4003.545, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.094, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.284, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.247, 'min': 2200.0, 'max': 3800.0}, {'current': 4008.135, 'min': 2200.0, 'max': 3800.0}, {'current': 3996.988, 'min': 2200.0, 'max': 3800.0}, {'current': 4008.027, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.263, 'min': 2200.0, 'max': 3800.0}, {'current': 4006.791, 'min': 2200.0, 'max': 3800.0}, {'current': 4007.211, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.831, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.694, 'min': 2200.0, 'max': 3800.0}, {'current': 4004.832, 'min': 2200.0, 'max': 3800.0}], 'disk': {'total': 915.3232879638672, 'used': 52.240882873535156}, 'gpu': 'NVIDIA GeForce RTX 3090', 'gpu_count': 3, 'gpu_devices': [{'name': 'NVIDIA GeForce RTX 3090', 'memory_total': 25447170048}, {'name': 'NVIDIA GeForce RTX 3090', 'memory_total': 25447170048}, {'name': 'NVIDIA GeForce RTX 3090', 'memory_total': 25438322688}], 'memory': {'total': 125.64818572998047}}
2022-11-16 13:47:36,471 INFO HandlerThread:824981 [system_monitor.py:probe():178] Finished collecting system info
2022-11-16 13:47:36,471 INFO HandlerThread:824981 [system_monitor.py:probe():181] Publishing system info
2022-11-16 13:47:36,471 DEBUG HandlerThread:824981 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment
2022-11-16 13:47:36,471 DEBUG HandlerThread:824981 [system_info.py:_save_pip():67] Saving pip packages done
2022-11-16 13:47:36,472 DEBUG HandlerThread:824981 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment
2022-11-16 13:47:37,375 INFO Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/wandb-summary.json
2022-11-16 13:47:37,376 INFO Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/requirements.txt
2022-11-16 13:47:37,376 INFO Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/conda-environment.yaml
2022-11-16 13:47:37,926 DEBUG HandlerThread:824981 [system_info.py:_save_conda():86] Saving conda packages done
2022-11-16 13:47:37,926 INFO HandlerThread:824981 [system_monitor.py:probe():183] Finished publishing system info
2022-11-16 13:47:37,995 DEBUG SenderThread:824981 [sender.py:send():303] send: files
2022-11-16 13:47:37,996 INFO SenderThread:824981 [sender.py:_save_file():1171] saving file wandb-metadata.json with policy now
2022-11-16 13:47:38,000 DEBUG HandlerThread:824981 [handler.py:handle_request():139] handle_request: stop_status
2022-11-16 13:47:38,000 DEBUG SenderThread:824981 [sender.py:send_request():317] send_request: stop_status
2022-11-16 13:47:38,088 DEBUG SenderThread:824981 [sender.py:send():303] send: telemetry
2022-11-16 13:47:38,088 DEBUG SenderThread:824981 [sender.py:send():303] send: metric
2022-11-16 13:47:38,088 DEBUG SenderThread:824981 [sender.py:send():303] send: telemetry
2022-11-16 13:47:38,088 DEBUG SenderThread:824981 [sender.py:send():303] send: metric
2022-11-16 13:47:38,088 WARNING SenderThread:824981 [sender.py:send_metric():1127] Seen metric with glob (shouldn't happen)
2022-11-16 13:47:38,346 INFO Thread-16 :824981 [upload_job.py:push():143] Uploaded file /tmp/tmp10ff2xhawandb/byh38mac-wandb-metadata.json
2022-11-16 13:47:38,375 INFO Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/conda-environment.yaml
2022-11-16 13:47:38,375 INFO Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/wandb-metadata.json
2022-11-16 13:47:41,387 INFO Thread-13 :824981 [dir_watcher.py:_on_file_created():275] file/dir created: ./wandb/run-20221116_134736-3a4w8w3w/files/output.log
2022-11-16 13:47:42,336 DEBUG SenderThread:824981 [sender.py:send():303] send: config
2022-11-16 13:47:42,732 DEBUG SenderThread:824981 [sender.py:send():303] send: config
2022-11-16 13:47:43,403 INFO Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/output.log
2022-11-16 13:48:53,773 INFO Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/output.log
2022-11-16 13:48:53,775 DEBUG SystemMonitor:824981 [system_monitor.py:_start():130] Starting system metrics aggregation loop
2022-11-16 13:48:54,755 WARNING StreamThr :824981 [internal.py:is_dead():385] Internal process exiting, parent pid 805480 disappeared
2022-11-16 13:48:54,755 ERROR StreamThr :824981 [internal.py:wandb_internal():147] Internal process shutdown.
2022-11-16 13:48:54,773 INFO Thread-13 :824981 [dir_watcher.py:_on_file_modified():292] file/dir modified: ./wandb/run-20221116_134736-3a4w8w3w/files/config.yaml
2022-11-16 13:48:54,887 INFO SenderThread:824981 [sender.py:finish():1331] shutting down sender
2022-11-16 13:48:55,490 INFO HandlerThread:824981 [handler.py:finish():814] shutting down handler
2022-11-16 13:48:55,739 INFO WriterThread:824981 [datastore.py:close():279] close: ./wandb/run-20221116_134736-3a4w8w3w/run-3a4w8w3w.wandb
2022-11-16 13:48:56,227 INFO SenderThread:824981 [dir_watcher.py:finish():362] shutting down directory watcher
2022-11-16 13:48:56,769 INFO MainThread:824981 [internal.py:handle_exit():78] Internal process exited
Thanks for your help.