Using a custom docker image

Hello,

I am getting stuck trying to run a job using a custom docker image and would like some advice.

I am trying to use my own custom image based off the nvidia modulus image (nvcr.io/nvidia/modulus/modulus) as I need to newer code than in the base image.

This is my dockerfile to generate my image

ARG PYT_VER=23.08
FROM nvcr.io/nvidia/modulus/modulus:$PYT_VER as builder

RUN python -m pip install tensorflow
RUN python -m pip uninstall -y nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch

ENV PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/ 

WORKDIR /modulus-launch/examples/cfd/vortex_shedding_mgn

ENTRYPOINT ["sh", "launch.sh"]

It just uninstalls the existing modulus code and installs tensorflow.

The build isn’t fancy

docker build -t my_modulus:latest -f Dockerfile .

For reference my launch.sh

python -m pip uninstall nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch -y

cd /modulus/
python -m pip install -e .

cd /modulus-sym/
python -m pip install -e .

cd /modulus-launch/
python -m pip install -e .

cd /modulus-launch/examples/cfd/vortex_shedding_mgn/
git config --global --add safe.directory /modulus-launch

pip install wandb --upgrade

python /modulus-launch/examples/cfd/vortex_shedding_mgn/wandb_train.py "$@"

This makes sure the container has modulus uninstalled and then installs a local version from the mounting points. Finally it run the training script.

To test my image I have used

docker run -e WANDB_API_KEY=<my api key> -e WANDB_DOCKER="my_modulus:latest" --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v <my path to modulus-launch>:/modulus-launch -v <my path to modulus>:/modulus -v <my path to modulus-sym>:/modulus-sym -v <my path to my dataset>:/datasets/ -v <my path to my workspace>:/workspace/ -it --rm my_modulus:latest --project <project name> --entity <my entity>

This works fine and I get a job created on wandb.

I have setup a docker queue

env:
  - PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/
gpus: all
volume:
  - <local path>:/modulus-launch
  - <local path>:/modulus
  - <local path>:/modulus-sym
  - <local path>:/datasets/
  - <local path>:/workspace/
builder:
  accelerator:
    base_image: my_modulus:latest

I have tried with and without the builder.

When launching the job from the website I use these options

{
    "args": [
        "--project",
        "<my project>",
        "--entity",
        "<my entitiy>"
    ],
    "run_config": {
        "epochs": 25,
        "ckpt_path": "/workspace/checkpoints_training_6"
    },
    "entry_point":
    [
    ]
}

This is to test changing the number of epoch and to save the checkpoints to a different folder.

The error I get is this

wandb: launch: Launching run in docker with command: docker run --rm -e WANDB_BASE_URL=https://api.wandb.ai -e WANDB_API_KEY -e WANDB_PROJECT=<project> -e WANDB_ENTITY=<my entity> -e WANDB_LAUNCH=True -e WANDB_RUN_ID=7dyx6mlk -e WANDB_USERNAME=<my username> -e WANDB_CONFIG='{"epochs": 25, "ckpt_path": "/workspace/checkpoints_training_6"}' -e WANDB_ARTIFACTS='{"_wandb_job": "<entity>/<project>/job-<job name>_wandb_train.py:latest"}' --env PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/ --gpus all --volume <local path>:/modulus-launch --volume <local path>:/modulus --volume <local path>:/modulus-sym --volume <local path>:/datasets/ --volume <local path>:/workspace/  <>_wandb_train.py:7864ffe4
Traceback (most recent call last):
  File "<>/examples/cfd/vortex_shedding_mgn/wandb_train.py", line 20, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

Looking at the documentation it isn’t clear to me how to do this correctly. If anyone has some advice on how to use a docker image similar to what I have created above I would be grateful.

Thanks

LimitingFactor

Hey @limitingfactor - in your setup are you explicitly installing pytorch anywhere? We don’t automatically do this, so I just wanted to check on what happens if you explicitly download this beforehand or include it in your setup.

Hi @uma-wandb, I am using the Nvidia modulus image as a starting point which already has pytorch installed in it.

hey @limitingfactor - were you able to get this up and running? have you run into any further issues?