Using a custom docker image

Hello,

I am getting stuck trying to run a job using a custom docker image and would like some advice.

I am trying to use my own custom image based off the nvidia modulus image (nvcr.io/nvidia/modulus/modulus) as I need to newer code than in the base image.

This is my dockerfile to generate my image

ARG PYT_VER=23.08
FROM nvcr.io/nvidia/modulus/modulus:$PYT_VER as builder

RUN python -m pip install tensorflow
RUN python -m pip uninstall -y nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch

ENV PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/ 

WORKDIR /modulus-launch/examples/cfd/vortex_shedding_mgn

ENTRYPOINT ["sh", "launch.sh"]

It just uninstalls the existing modulus code and installs tensorflow.

The build isn’t fancy

docker build -t my_modulus:latest -f Dockerfile .

For reference my launch.sh

python -m pip uninstall nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch -y

cd /modulus/
python -m pip install -e .

cd /modulus-sym/
python -m pip install -e .

cd /modulus-launch/
python -m pip install -e .

cd /modulus-launch/examples/cfd/vortex_shedding_mgn/
git config --global --add safe.directory /modulus-launch

pip install wandb --upgrade

python /modulus-launch/examples/cfd/vortex_shedding_mgn/wandb_train.py "$@"

This makes sure the container has modulus uninstalled and then installs a local version from the mounting points. Finally it run the training script.

To test my image I have used

docker run -e WANDB_API_KEY=<my api key> -e WANDB_DOCKER="my_modulus:latest" --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v <my path to modulus-launch>:/modulus-launch -v <my path to modulus>:/modulus -v <my path to modulus-sym>:/modulus-sym -v <my path to my dataset>:/datasets/ -v <my path to my workspace>:/workspace/ -it --rm my_modulus:latest --project <project name> --entity <my entity>

This works fine and I get a job created on wandb.

I have setup a docker queue

env:
  - PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/
gpus: all
volume:
  - <local path>:/modulus-launch
  - <local path>:/modulus
  - <local path>:/modulus-sym
  - <local path>:/datasets/
  - <local path>:/workspace/
builder:
  accelerator:
    base_image: my_modulus:latest

I have tried with and without the builder.

When launching the job from the website I use these options

{
    "args": [
        "--project",
        "<my project>",
        "--entity",
        "<my entitiy>"
    ],
    "run_config": {
        "epochs": 25,
        "ckpt_path": "/workspace/checkpoints_training_6"
    },
    "entry_point":
    [
    ]
}

This is to test changing the number of epoch and to save the checkpoints to a different folder.

The error I get is this

wandb: launch: Launching run in docker with command: docker run --rm -e WANDB_BASE_URL=https://api.wandb.ai -e WANDB_API_KEY -e WANDB_PROJECT=<project> -e WANDB_ENTITY=<my entity> -e WANDB_LAUNCH=True -e WANDB_RUN_ID=7dyx6mlk -e WANDB_USERNAME=<my username> -e WANDB_CONFIG='{"epochs": 25, "ckpt_path": "/workspace/checkpoints_training_6"}' -e WANDB_ARTIFACTS='{"_wandb_job": "<entity>/<project>/job-<job name>_wandb_train.py:latest"}' --env PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/ --gpus all --volume <local path>:/modulus-launch --volume <local path>:/modulus --volume <local path>:/modulus-sym --volume <local path>:/datasets/ --volume <local path>:/workspace/  <>_wandb_train.py:7864ffe4
Traceback (most recent call last):
  File "<>/examples/cfd/vortex_shedding_mgn/wandb_train.py", line 20, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

Looking at the documentation it isn’t clear to me how to do this correctly. If anyone has some advice on how to use a docker image similar to what I have created above I would be grateful.

Thanks

LimitingFactor

Hey @limitingfactor - in your setup are you explicitly installing pytorch anywhere? We don’t automatically do this, so I just wanted to check on what happens if you explicitly download this beforehand or include it in your setup.

Hi @uma-wandb, I am using the Nvidia modulus image as a starting point which already has pytorch installed in it.

hey @limitingfactor - were you able to get this up and running? have you run into any further issues?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.