Hello,
I am getting stuck trying to run a job using a custom docker image and would like some advice.
I am trying to use my own custom image based off the nvidia modulus image (nvcr.io/nvidia/modulus/modulus) as I need to newer code than in the base image.
This is my dockerfile to generate my image
ARG PYT_VER=23.08
FROM nvcr.io/nvidia/modulus/modulus:$PYT_VER as builder
RUN python -m pip install tensorflow
RUN python -m pip uninstall -y nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch
ENV PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/
WORKDIR /modulus-launch/examples/cfd/vortex_shedding_mgn
ENTRYPOINT ["sh", "launch.sh"]
It just uninstalls the existing modulus code and installs tensorflow.
The build isn’t fancy
docker build -t my_modulus:latest -f Dockerfile .
For reference my launch.sh
python -m pip uninstall nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch -y
cd /modulus/
python -m pip install -e .
cd /modulus-sym/
python -m pip install -e .
cd /modulus-launch/
python -m pip install -e .
cd /modulus-launch/examples/cfd/vortex_shedding_mgn/
git config --global --add safe.directory /modulus-launch
pip install wandb --upgrade
python /modulus-launch/examples/cfd/vortex_shedding_mgn/wandb_train.py "$@"
This makes sure the container has modulus uninstalled and then installs a local version from the mounting points. Finally it run the training script.
To test my image I have used
docker run -e WANDB_API_KEY=<my api key> -e WANDB_DOCKER="my_modulus:latest" --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v <my path to modulus-launch>:/modulus-launch -v <my path to modulus>:/modulus -v <my path to modulus-sym>:/modulus-sym -v <my path to my dataset>:/datasets/ -v <my path to my workspace>:/workspace/ -it --rm my_modulus:latest --project <project name> --entity <my entity>
This works fine and I get a job created on wandb.
I have setup a docker queue
env:
- PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/
gpus: all
volume:
- <local path>:/modulus-launch
- <local path>:/modulus
- <local path>:/modulus-sym
- <local path>:/datasets/
- <local path>:/workspace/
builder:
accelerator:
base_image: my_modulus:latest
I have tried with and without the builder.
When launching the job from the website I use these options
{
"args": [
"--project",
"<my project>",
"--entity",
"<my entitiy>"
],
"run_config": {
"epochs": 25,
"ckpt_path": "/workspace/checkpoints_training_6"
},
"entry_point":
[
]
}
This is to test changing the number of epoch and to save the checkpoints to a different folder.
The error I get is this
wandb: launch: Launching run in docker with command: docker run --rm -e WANDB_BASE_URL=https://api.wandb.ai -e WANDB_API_KEY -e WANDB_PROJECT=<project> -e WANDB_ENTITY=<my entity> -e WANDB_LAUNCH=True -e WANDB_RUN_ID=7dyx6mlk -e WANDB_USERNAME=<my username> -e WANDB_CONFIG='{"epochs": 25, "ckpt_path": "/workspace/checkpoints_training_6"}' -e WANDB_ARTIFACTS='{"_wandb_job": "<entity>/<project>/job-<job name>_wandb_train.py:latest"}' --env PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/ --gpus all --volume <local path>:/modulus-launch --volume <local path>:/modulus --volume <local path>:/modulus-sym --volume <local path>:/datasets/ --volume <local path>:/workspace/ <>_wandb_train.py:7864ffe4
Traceback (most recent call last):
File "<>/examples/cfd/vortex_shedding_mgn/wandb_train.py", line 20, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
Looking at the documentation it isn’t clear to me how to do this correctly. If anyone has some advice on how to use a docker image similar to what I have created above I would be grateful.
Thanks
LimitingFactor