Hi,
I’m using logging my experiments with wandb. It logs all the GPU usage information on my local machine, but after I move to my school’s HPC environment (same virtual env), all the GPU usage are gone. (the GPU hardware info is recognized normally). Can anyone help me locate the issue?
Hi @edwardjjj! Thank you for writing in.
Could you please talk a bit about your environment, as well as are you using any external libraries for your training with wandb such as tf or ptl?
Could you please provide the debug.log and debug-internal.log files associated with the run where you are running into this issue? These files should be located in the wandb folder relative to your working directory.
Hi @artsiom , thank you for reaching out. Our HPC system uses Altair Grid Engine for managing job submissions. I’m using pytorch, I wrote my training loop without the pytorch trainer.
debug.log
2024-04-30 08:47:18,577 INFO MainThread:31528 [wandb_setup.py:_flush():76] Current SDK version is 0.16.6
2024-04-30 08:47:18,577 INFO MainThread:31528 [wandb_setup.py:_flush():76] Configure stats pid to 31528
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_setup.py:_flush():76] Loading settings from /home/7/ud02257/.config/wandb/settings
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_setup.py:_flush():76] Loading settings from /gs/fs/tga-aklab/edward/projects/sefarnlp/wandb/settings
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'llama.py', 'program_abspath': '/gs/fs/tga-aklab/edward/projects/sefarnlp/llama.py', 'program': 'llama.py'}
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_setup.py:_flush():76] Applying login settings: {}
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_init.py:_log_setup():521] Logging user logs to /gs/fs/tga-aklab/edward/projects/sefarnlp/wandb/run-20240430_084718-s8qdupai/logs/debug.log
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_init.py:_log_setup():522] Logging internal logs to /gs/fs/tga-aklab/edward/projects/sefarnlp/wandb/run-20240430_084718-s8qdupai/logs/debug-internal.log
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_init.py:init():561] calling init triggers
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_init.py:init():568] wandb.init called with sweep_config: {}
config: {'model': {'path': 'facebook/opt-350m', 'num_labels': 20}, 'lora': {'r': 64, 'lora_dropout': 0.01, 'lora_alpha': 16}, 'dataset': {'name': 'rungalileo/20_Newsgroups_Fixed', 'finetuning_num': 200}, 'experiment': {'rounds': 10, 'epochs': 51, 'interval': 5, 'lr': 0.0001, 'log_interval': 100, 'seed': 42}}
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_init.py:init():611] starting backend
2024-04-30 08:47:18,578 INFO MainThread:31528 [wandb_init.py:init():615] setting up manager
2024-04-30 08:47:18,581 INFO MainThread:31528 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-30 08:47:18,584 INFO MainThread:31528 [wandb_init.py:init():623] backend started and connected
2024-04-30 08:47:18,590 INFO MainThread:31528 [wandb_init.py:init():715] updated telemetry
2024-04-30 08:47:18,615 INFO MainThread:31528 [wandb_init.py:init():748] communicating run to backend with 90.0 second timeout
2024-04-30 08:47:19,110 INFO MainThread:31528 [wandb_run.py:_on_init():2357] communicating current version
2024-04-30 08:47:19,155 INFO MainThread:31528 [wandb_run.py:_on_init():2366] got version response
2024-04-30 08:47:19,155 INFO MainThread:31528 [wandb_init.py:init():799] starting run threads in backend
2024-04-30 08:47:21,809 INFO MainThread:31528 [wandb_run.py:_console_start():2335] atexit reg
2024-04-30 08:47:21,809 INFO MainThread:31528 [wandb_run.py:_redirect():2190] redirect: wrap_raw
2024-04-30 08:47:21,809 INFO MainThread:31528 [wandb_run.py:_redirect():2255] Wrapping output streams.
2024-04-30 08:47:21,809 INFO MainThread:31528 [wandb_run.py:_redirect():2280] Redirects installed.
2024-04-30 08:47:21,810 INFO MainThread:31528 [wandb_init.py:init():842] run started, returning control to user process
2024-04-30 08:47:29,192 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 08:50:59,510 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 08:54:29,229 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 08:57:58,774 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:01:27,743 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:04:56,233 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:08:23,644 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:11:50,740 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:15:19,674 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:18:48,951 INFO MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:22:22,496 WARNING MsgRouterThr:31528 [router.py:message_loop():77] message_loop has been closed
The debug-internal.log is 50000 lines long, after a bit digging, I found the following error that maybe the cause?
2024-04-30 08:47:49,347 ERROR gpu :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,349 ERROR gpu :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,351 ERROR gpu :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,353 ERROR gpu :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,355 ERROR gpu :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,357 ERROR gpu :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,358 ERROR gpu :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
The /proc/1/
permission level looks like the following, but I couldn’t cd
into it.
dr-xr-xr-x 9 root root 0 Apr 24 10:41 1
I think thats exactly where the issue is coming from. This definitely has something to do with the school cluster since it does work without an issue on your personal machine.
I wonder if its possible for you to get higher permissions on your cluster
Hi Wanted to follow up with you regarding this thread. Have you had a chance to look into the permissions issues yet?
Hi, thank you for checking in. Unfortunately the system admin are too busy right now and don’t have bandwidth for my problem yet. I will update as soon as I hear back from them.
No worries at. I will go ahead and close this ticket out from our side for internal tracking purposes, if the fix does not work on you side. You are welcome to write back in here or go ahead and create a new discourse thread with a reference to this one.
Hi, @artsiom. I have tried using torch.cuda.max_memory_allocated() to check my memory consumption and it works fine. Is wandb using a different backend than torch?