No GPU usage on HPC environment

Hi,
I’m using logging my experiments with wandb. It logs all the GPU usage information on my local machine, but after I move to my school’s HPC environment (same virtual env), all the GPU usage are gone. (the GPU hardware info is recognized normally). Can anyone help me locate the issue?

Hi @edwardjjj! Thank you for writing in.

Could you please talk a bit about your environment, as well as are you using any external libraries for your training with wandb such as tf or ptl?

Could you please provide the debug.log and debug-internal.log files associated with the run where you are running into this issue? These files should be located in the wandb folder relative to your working directory.

Hi @artsiom , thank you for reaching out. Our HPC system uses Altair Grid Engine for managing job submissions. I’m using pytorch, I wrote my training loop without the pytorch trainer.

debug.log

2024-04-30 08:47:18,577 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Current SDK version is 0.16.6
2024-04-30 08:47:18,577 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Configure stats pid to 31528
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Loading settings from /home/7/ud02257/.config/wandb/settings
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Loading settings from /gs/fs/tga-aklab/edward/projects/sefarnlp/wandb/settings
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'llama.py', 'program_abspath': '/gs/fs/tga-aklab/edward/projects/sefarnlp/llama.py', 'program': 'llama.py'}
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_setup.py:_flush():76] Applying login settings: {}
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_init.py:_log_setup():521] Logging user logs to /gs/fs/tga-aklab/edward/projects/sefarnlp/wandb/run-20240430_084718-s8qdupai/logs/debug.log
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_init.py:_log_setup():522] Logging internal logs to /gs/fs/tga-aklab/edward/projects/sefarnlp/wandb/run-20240430_084718-s8qdupai/logs/debug-internal.log
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_init.py:init():561] calling init triggers
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_init.py:init():568] wandb.init called with sweep_config: {}
config: {'model': {'path': 'facebook/opt-350m', 'num_labels': 20}, 'lora': {'r': 64, 'lora_dropout': 0.01, 'lora_alpha': 16}, 'dataset': {'name': 'rungalileo/20_Newsgroups_Fixed', 'finetuning_num': 200}, 'experiment': {'rounds': 10, 'epochs': 51, 'interval': 5, 'lr': 0.0001, 'log_interval': 100, 'seed': 42}}
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_init.py:init():611] starting backend
2024-04-30 08:47:18,578 INFO    MainThread:31528 [wandb_init.py:init():615] setting up manager
2024-04-30 08:47:18,581 INFO    MainThread:31528 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-30 08:47:18,584 INFO    MainThread:31528 [wandb_init.py:init():623] backend started and connected
2024-04-30 08:47:18,590 INFO    MainThread:31528 [wandb_init.py:init():715] updated telemetry
2024-04-30 08:47:18,615 INFO    MainThread:31528 [wandb_init.py:init():748] communicating run to backend with 90.0 second timeout
2024-04-30 08:47:19,110 INFO    MainThread:31528 [wandb_run.py:_on_init():2357] communicating current version
2024-04-30 08:47:19,155 INFO    MainThread:31528 [wandb_run.py:_on_init():2366] got version response 
2024-04-30 08:47:19,155 INFO    MainThread:31528 [wandb_init.py:init():799] starting run threads in backend
2024-04-30 08:47:21,809 INFO    MainThread:31528 [wandb_run.py:_console_start():2335] atexit reg
2024-04-30 08:47:21,809 INFO    MainThread:31528 [wandb_run.py:_redirect():2190] redirect: wrap_raw
2024-04-30 08:47:21,809 INFO    MainThread:31528 [wandb_run.py:_redirect():2255] Wrapping output streams.
2024-04-30 08:47:21,809 INFO    MainThread:31528 [wandb_run.py:_redirect():2280] Redirects installed.
2024-04-30 08:47:21,810 INFO    MainThread:31528 [wandb_init.py:init():842] run started, returning control to user process
2024-04-30 08:47:29,192 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 08:50:59,510 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 08:54:29,229 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 08:57:58,774 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:01:27,743 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:04:56,233 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:08:23,644 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:11:50,740 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:15:19,674 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:18:48,951 INFO    MainThread:31528 [wandb_watch.py:watch():51] Watching
2024-04-30 09:22:22,496 WARNING MsgRouterThr:31528 [router.py:message_loop():77] message_loop has been closed

The debug-internal.log is 50000 lines long, after a bit digging, I found the following error that maybe the cause?

2024-04-30 08:47:49,347 ERROR   gpu       :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,349 ERROR   gpu       :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,351 ERROR   gpu       :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,353 ERROR   gpu       :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,355 ERROR   gpu       :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,357 ERROR   gpu       :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'
2024-04-30 08:47:49,358 ERROR   gpu       :31661 [interfaces.py:aggregate():161] Failed to serialize metric: [Errno 1] Operation not permitted: '/proc/1/stat'

The /proc/1/permission level looks like the following, but I couldn’t cd into it.
dr-xr-xr-x 9 root root 0 Apr 24 10:41 1

I think thats exactly where the issue is coming from. This definitely has something to do with the school cluster since it does work without an issue on your personal machine.

I wonder if its possible for you to get higher permissions on your cluster

Hi Wanted to follow up with you regarding this thread. Have you had a chance to look into the permissions issues yet?

Hi, thank you for checking in. Unfortunately the system admin are too busy right now and don’t have bandwidth for my problem yet. I will update as soon as I hear back from them.

No worries at. I will go ahead and close this ticket out from our side for internal tracking purposes, if the fix does not work on you side. You are welcome to write back in here or go ahead and create a new discourse thread with a reference to this one.

Hi, @artsiom. I have tried using torch.cuda.max_memory_allocated() to check my memory consumption and it works fine. Is wandb using a different backend than torch?