I wanted to run a set of wandb agents but the server that I have access to does not have slurm or any workload manager. How do I have the agent code select an CUDA_VISIBLE_DEVICE automatically according to the gpu memory I need and is available?
I was going to do something hacky like a while loop that checked with a memory error happened after looping all GPUs or the total number of agents the user desired was reached. But was hoping to avoid doing this since this seems the entire reason I am using wandb in the first place.
Some started code
"""
Main Idea:
- create sweep with a sweep config & get sweep_id for the agents (note, this creates a sweep in wandb's website)
- create agent to run a setting of hps by giving it the sweep_id (that mataches the sweep in the wandb website)
- keep running agents with sweep_id until you're done
note:
- Each individual training session with a specific set of hyperparameters in a sweep is considered a wandb run.
ref:
- read: https://docs.wandb.ai/guides/sweeps
"""
import wandb
from pprint import pprint
import math
import torch
sweep_config: dict = {
"project": "playground",
"entity": "your_wanbd_username",
"name": "my-ultimate-sweep",
"metric":
{"name": "train_loss",
"goal": "minimize"}
,
"method": "random",
"parameters": None, # not set yet
}
parameters = {
'optimizer': {
'values': ['adam', 'adafactor']}
,
'scheduler': {
'values': ['cosine', 'none']} # todo, think how to do
,
'lr': {
"distribution": "log_uniform_values",
"min": 1e-6,
"max": 0.2}
,
'batch_size': {
# integers between 32 and 256
# with evenly-distributed logarithms
'distribution': 'q_log_uniform_values',
'q': 8,
'min': 32,
'max': 256,
}
,
# it's often the case that some hps we don't want to vary in the run e.g. num_its
'num_its': {'value': 5}
}
sweep_config['parameters'] = parameters
pprint(sweep_config)
# create sweep in wandb's website & get sweep_id to create agents that run a single agent with a set of hps
sweep_id = wandb.sweep(sweep_config)
print(f'{sweep_id=}')
def my_train_func():
# read the current value of parameter "a" from wandb.config
# I don't think we need the group since the sweep name is already the group
run = wandb.init(config=sweep_config)
print(f'{run=}')
pprint(f'{wandb.config=}')
lr = wandb.config.lr
num_its = wandb.config.num_its
train_loss: float = 8.0 + torch.rand(1).item()
for i in range(num_its):
# get a random update step from the range [0.0, 1.0] using torch
update_step: float = lr * torch.rand(1).item()
wandb.log({"lr": lr, "train_loss": train_loss - update_step})
run.finish()
# run the sweep, The cell below will launch an agent that runs train 5 times, usingly the randomly-generated hyperparameter values returned by the Sweep Controller.
wandb.agent(sweep_id, function=my_train_func, count=5)
Hi @brando to select a specific GPU before running a wandb agent, you can set the CUDA_VISIBLE_DEVICES environment variable. This variable determines which GPU(s) will be used by your script. Here’s an example of how to set CUDA_VISIBLE_DEVICES and run a wandb agent with a specific GPU:
bash
CUDA_VISIBLE_DEVICES=0 wandb
agent sweep_ID
In this example, replace sweep_ID with your actual sweep ID. The CUDA_VISIBLE_DEVICES=0 part sets the first GPU (index 0) as the visible device for your script. If you have multiple GPUs and want to use another one, change the index accordingly (e.g., CUDA_VISIBLE_DEVICES=1 for the second GPU). Make sure you have the NVIDIA CUDA Toolkit installed on your system to use CUDA_VISIBLE_DEVICES.
Here are some references you can check out as well:
Hi, I just want to distinguish the differences between W&B and SLURM. W&B focuses on experiment tracking, visualization, and collaboration, etc. while SLURM is a workload manager for managing resources and job scheduling on a cluster. W&B does not handle workload management or job scheduling like SLURM.
You can run Sweeps on SLURM and here are a couple of third-party resources that might be able to support this use case:
slurm was an analogy. I don’t have slurm (unfortunately, otherwise I wouldn’t be bothering you with it). I know how to use slrum :). I just have access directly to a single A100 server I share with ppl and want to make it painless to allocate gpu jobs.
Do you think this stratgey is a good idea for the short term:
select a gpu
put number of counts to be say 10 or a large number
only the jobs that have gpu memory will be allocated, the rest will fail
This seems reasonable for the short term. By setting the count parameter in wandb.agent() or using wandb agent --count , you can limit the number of trials to run. This way, only the jobs that have GPU memory will be allocated, and the rest will fail.
what is a good long term? note I don’t have sudo priveledges in my cluster/servers + GPT4 doesn’t suggest a reasonably easy way to install slurm myself:
How do I install slurm locally with minimum configurations on a single server?
To install Slurm locally with minimum configurations on a single server, follow these steps:
Install dependencies:
Slurm has several dependencies that need to be installed before you can install Slurm. On a Debian-based system (e.g., Ubuntu), run the following commands:
bash
Copy code
sudo apt-get update
sudo apt-get install -y gcc make munge libmunge-dev libmunge2 libpam0g-dev libmysqlclient-dev libssl-dev libncurses5-dev
Create a Munge key:
Munge is an authentication mechanism used by Slurm. You need to create a secret key and distribute it to all nodes. In this case, you only have one node.
bash
Copy code
sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
Start the Munge service:
bash
Copy code
sudo systemctl enable munge
sudo systemctl start munge
Download and compile Slurm:
bash
Copy code
wget https://download.schedmd.com/slurm/slurm-21.08.5.tar.bz2
tar xjf slurm-21.08.5.tar.bz2
cd slurm-21.08.5
./configure
make
sudo make install
Create a minimal Slurm configuration:
Create a minimal slurm.conf file, replacing <your_hostname> with the output of hostname command.
bash
Copy code
sudo mkdir -p /etc/slurm
sudo nano /etc/slurm/slurm.conf
Paste the following content into the file, adjusting it to your system:
bash
Copy code
ControlMachine=<your_hostname>
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# Define a single partition
PartitionName=debug Nodes=<your_hostname> Default=YES MaxTime=INFINITE State=UP
# Define the single node
NodeName=<your_hostname> CPUs=1 State=UNKNOWN
Create the Slurm directories:
bash
Copy code
sudo mkdir -p /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
sudo chown slurm: /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
Start the Slurm services:
bash
Copy code
sudo systemctl enable slurmd.service slurmctld.service
sudo systemctl start slurmd.service slurmctld.service
Now Slurm should be installed and running on your single server. To check the status, use the sinfo command.
Brando Miranda
try one more time with this constraint. I don't have sudo privileges. So how do I install slurm then on a single server locally?
If you don't have sudo privileges, you can still install Slurm locally in your home directory. Here's how you can do it:
Install dependencies:
Ask your system administrator to install the necessary dependencies if they aren't already installed on the system:
gcc
make
munge
libmunge-dev
libmunge2
libpam0g-dev
libmysqlclient-dev
libssl-dev
libncurses5-dev
Configure Munge:
Ask your system administrator to configure Munge, create a secret key, and start the Munge service. This is necessary for Slurm to work.
Download and compile Slurm:
bash
Copy code
wget https://download.schedmd.com/slurm/slurm-21.08.5.tar.bz2
tar xjf slurm-21.08.5.tar.bz2
cd slurm-21.08.5
./configure --prefix=$HOME/slurm
make
make install
Update your PATH variable:
bash
Copy code
echo 'export PATH=$HOME/slurm/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
Create a minimal Slurm configuration:
Create a minimal slurm.conf file, replacing <your_hostname> with the output of the hostname command.
bash
Copy code
mkdir -p $HOME/slurm/etc
nano $HOME/slurm/etc/slurm.conf
Paste the following content into the file, adjusting it to your system:
bash
Copy code
ControlMachine=<your_hostname>
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=$HOME/slurm/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=$HOME/slurm/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=$HOME/slurm/var/spool/slurmd
SlurmUser=$(whoami)
StateSaveLocation=$HOME/slurm/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# Define a single partition
PartitionName=debug Nodes=<your_hostname> Default=YES MaxTime=INFINITE State=UP
# Define the single node
NodeName=<your_hostname> CPUs=1 State=UNKNOWN
Create the Slurm directories:
bash
Copy code
mkdir -p $HOME/slurm/var/run $HOME/slurm/var/spool/slurmctld $HOME/slurm/var/spool/slurmd $HOME/slurm/var/log
Start the Slurm services:
bash
Copy code
slurmctld -D -c -f $HOME/slurm/etc/slurm.conf &
slurmd -D -c -f $HOME/slurm/etc/slurm.conf &
The -D flag runs the services in the foreground. If you want to run them in the background, you can remove the -D flag.
Now Slurm should be installed and running locally in your home directory on the single server. To check the status, use the sinfo command.
Brando Miranda
Give me the commands to install the dependencies locally without sudo priviledges
gcc
make
munge
libmunge-dev
libmunge2
libpam0g-dev
libmysqlclient-dev
libssl-dev
libncurses5-dev
Installing system-level dependencies without sudo privileges can be challenging. However, you can try to build these dependencies from source and install them locally in your home directory.
Keep in mind that building all these packages from source can be time-consuming and may require additional dependencies. Here is a general outline of the steps to build each dependency from source:
Create a local directory for the dependencies:
bash
Copy code
mkdir -p $HOME/local
Set environment variables:
bash
Copy code
echo 'export PATH=$HOME/local/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$HOME/local/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
echo 'export PKG_CONFIG_PATH=$HOME/local/lib/pkgconfig:$PKG_CONFIG_PATH' >> ~/.bashrc
source ~/.bashrc
Download, build, and install each dependency:
For each dependency, you generally follow these steps:
Download the source code.
Extract the archive.
Enter the source directory.
Configure the build system, specifying the installation prefix as $HOME/local.
Build the software using make.
Install the software using make install.
For example, here is the process for building and installing gcc locally:
bash
Copy code
wget https://ftp.gnu.org/gnu/gcc/gcc-11.2.0/gcc-11.2.0.tar.gz
tar xzf gcc-11.2.0.tar.gz
cd gcc-11.2.0
./contrib/download_prerequisites
mkdir build && cd build
../configure --prefix=$HOME/local --enable-languages=c,c++ --disable-multilib
make -j$(nproc)
make install
You need to follow similar steps for the other dependencies. However, each dependency may have its specific requirements and additional flags for the configure script. Consult the documentation for each dependency for the exact steps to build and install it.
Keep in mind that this process can be complex, and some dependencies may require additional libraries or tools. If you encounter difficulties or are unable to install certain dependencies, consider reaching out to your system administrator for assistance.
Brando Miranda
also tell me how to do this step locally myself without sudo priviledges
"Configure Munge:
Ask your system administrator to configure Munge, create a secret key, and start the Munge service. This is necessary for Slurm to work."
To configure Munge locally without sudo privileges, follow these steps:
Download and compile Munge:
bash
Copy code
wget https://github.com/dun/munge/releases/download/munge-0.5.14/munge-0.5.14.tar.xz
tar xf munge-0.5.14.tar.xz
cd munge-0.5.14
./configure --prefix=$HOME/local --sysconfdir=$HOME/local/etc --localstatedir=$HOME/local/var
make
make install
Update your PATH variable:
bash
Copy code
echo 'export PATH=$HOME/local/sbin:$PATH' >> ~/.bashrc
source ~/.bashrc
Create a Munge key:
bash
Copy code
dd if=/dev/urandom of=$HOME/local/etc/munge.key bs=1 count=1024
chmod 400 $HOME/local/etc/munge.key
Create necessary directories:
bash
Copy code
mkdir -p $HOME/local/etc/munge $HOME/local/var/run/munge $HOME/local/var/log/munge
Set up Munge environment variables:
bash
Copy code
echo 'export MUNGEUSER=$(whoami)' >> ~/.bashrc
echo 'export MUNGE_PID_FILE=$HOME/local/var/run/munge/munged.pid' >> ~/.bashrc
echo 'export MUNGE_LOG_FILE=$HOME/local/var/log/munge/munged.log' >> ~/.bashrc
source ~/.bashrc
Start the Munge service:
bash
Copy code
munged --sysconfdir=$HOME/local/etc --key-file=$HOME/local/etc/munge.key --socket=$HOME/local/var/run/munge/munge.socket.2 --nthreads=2 --pidfile=$MUNGE_PID_FILE --log-file=$MUNGE_LOG_FILE --user=$MUNGEUSER
Munge should now be running locally in your home directory. Keep in mind that since you don't have sudo privileges, you'll need to use different port numbers for Slurm, which you can configure in your slurm.conf file. Make sure to update the SlurmctldPort and SlurmdPort settings to use unused, high-numbered ports (e.g., 50000 and 50001).
In case you have a static set of GPUs you are trying to allocate, You should be able to use your sweep’s config.yaml to supply the CUDA_VISIBLE_DEVICES environment variable as an $envvar to each agent.
I would suggest checking our documentation here:
Unfortunately, W&B does not support resource management directly from within the SDK at the time.
For a more long term solution, you could the wandb agent call around a bash script to check available GPUs and allocate them as such. I don’t have a server with multiple GPU’s to test this on, but here is a script I whipped up (with some help from our friend GPT-4), which should give you a reasonable starting point:
#!/bin/bash
# Check if the nvidia-smi command exists
if ! command -v nvidia-smi &> /dev/null; then
echo "nvidia-smi command not found. Please install the NVIDIA drivers."
exit 1
fi
# Check if a memory limit argument is provided
if [ -z "$1" ]; then
echo "Usage: $0 <minimum_memory_in_MB>"
exit 1
fi
minimum_memory_in_MB=$1
# List the available NVIDIA GPUs with their memory size
nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader | while IFS=',' read -r index name memory; do
memory_in_MB=${memory% MiB*}
if [ "$memory_in_MB" -gt "$minimum_memory_in_MB" ]; then
echo "Index: $index, Name: $name, Memory: $memory"
fi
done
Hi Brando, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!