Using a custom docker image


I am getting stuck trying to run a job using a custom docker image and would like some advice.

I am trying to use my own custom image based off the nvidia modulus image ( as I need to newer code than in the base image.

This is my dockerfile to generate my image

FROM$PYT_VER as builder

RUN python -m pip install tensorflow
RUN python -m pip uninstall -y nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch

ENV PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/ 

WORKDIR /modulus-launch/examples/cfd/vortex_shedding_mgn

ENTRYPOINT ["sh", ""]

It just uninstalls the existing modulus code and installs tensorflow.

The build isn’t fancy

docker build -t my_modulus:latest -f Dockerfile .

For reference my

python -m pip uninstall nvidia-modulus nvidia-modulus.sym nvidia-modulus.launch -y

cd /modulus/
python -m pip install -e .

cd /modulus-sym/
python -m pip install -e .

cd /modulus-launch/
python -m pip install -e .

cd /modulus-launch/examples/cfd/vortex_shedding_mgn/
git config --global --add /modulus-launch

pip install wandb --upgrade

python /modulus-launch/examples/cfd/vortex_shedding_mgn/ "$@"

This makes sure the container has modulus uninstalled and then installs a local version from the mounting points. Finally it run the training script.

To test my image I have used

docker run -e WANDB_API_KEY=<my api key> -e WANDB_DOCKER="my_modulus:latest" --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v <my path to modulus-launch>:/modulus-launch -v <my path to modulus>:/modulus -v <my path to modulus-sym>:/modulus-sym -v <my path to my dataset>:/datasets/ -v <my path to my workspace>:/workspace/ -it --rm my_modulus:latest --project <project name> --entity <my entity>

This works fine and I get a job created on wandb.

I have setup a docker queue

  - PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/
gpus: all
  - <local path>:/modulus-launch
  - <local path>:/modulus
  - <local path>:/modulus-sym
  - <local path>:/datasets/
  - <local path>:/workspace/
    base_image: my_modulus:latest

I have tried with and without the builder.

When launching the job from the website I use these options

    "args": [
        "<my project>",
        "<my entitiy>"
    "run_config": {
        "epochs": 25,
        "ckpt_path": "/workspace/checkpoints_training_6"

This is to test changing the number of epoch and to save the checkpoints to a different folder.

The error I get is this

wandb: launch: Launching run in docker with command: docker run --rm -e WANDB_BASE_URL= -e WANDB_API_KEY -e WANDB_PROJECT=<project> -e WANDB_ENTITY=<my entity> -e WANDB_LAUNCH=True -e WANDB_RUN_ID=7dyx6mlk -e WANDB_USERNAME=<my username> -e WANDB_CONFIG='{"epochs": 25, "ckpt_path": "/workspace/checkpoints_training_6"}' -e WANDB_ARTIFACTS='{"_wandb_job": "<entity>/<project>/job-<job name>"}' --env PYTHONPATH=/modulus/:/modulus-sym/:/modulus-launch/ --gpus all --volume <local path>:/modulus-launch --volume <local path>:/modulus --volume <local path>:/modulus-sym --volume <local path>:/datasets/ --volume <local path>:/workspace/  <>
Traceback (most recent call last):
  File "<>/examples/cfd/vortex_shedding_mgn/", line 20, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

Looking at the documentation it isn’t clear to me how to do this correctly. If anyone has some advice on how to use a docker image similar to what I have created above I would be grateful.



Hey @limitingfactor - in your setup are you explicitly installing pytorch anywhere? We don’t automatically do this, so I just wanted to check on what happens if you explicitly download this beforehand or include it in your setup.

Hi @uma-wandb, I am using the Nvidia modulus image as a starting point which already has pytorch installed in it.

hey @limitingfactor - were you able to get this up and running? have you run into any further issues?