Agent build docker image failing to find file paths

Hi, I’m fairly new to using wandb and have chosen it for it’s great documentation! However, one area that has been confusing to me is using an agent to build a docker image.

Currently I’m logging my source code and files to wandb using run.log_code() outlined in the python SDK here. At this point my plans are to have the agent computer download the files logged, compile it into a docker image, and execute the training.

Where I’m confused is how to instruct the agent computer to download the logged files. With regards to the wandb docs, all I’m instructed to do is “add docker as a builder type”. When ran, docker on the agent computer fails to build the docker image because it “cannot find the file paths specified” in the docker file. Based on this error and how fast docker claims to have downloaded all the artifacts from wandb, I’m unsure if the logged files are being downloaded in the first place.

I’d appreciate any insight, examples, or advice! I’m very confused as to where to go from here as I’ve exhausted the wandb docs. Thanks!

Hi @19kdc3 , thank you for reaching out and happy to help. Will review your question and circle back this week.

1 Like

Hi @19kdc3 , following back up on this thread. I’ve had a chance to review and have the following feedback

It sounds like you’re trying to integrate several components of the W&B ecosystem: logging code, building Docker images with an agent, and running training jobs.

  1. Logging Code: You’re using run.log_code() to log your source code to W&B.
  2. Using an Agent to Build a Docker Image: You want to use a W&B agent to build a Docker image that includes the code you’ve logged. The agent should download the logged files and use them to build the Docker image.
  3. Building the Docker Image: The Docker build process requires a Dockerfile that specifies how to build the image. This Dockerfile must be accessible to the Docker daemon on the agent machine. The error you’re encountering suggests that the Docker daemon cannot find the files specified in the Dockerfile.

Here’s what you need to ensure for the process to work correctly:

  • Dockerfile Location: Make sure that the Dockerfile is included in the files you log with run.log_code() or is otherwise accessible to the agent.
  • Artifact Download: Before building the Docker image, the agent needs to download the artifacts (which include your code) from W&B. This is typically done in the script that the agent runs before executing the docker build command.
  • Correct Paths: The Dockerfile should reference files using the paths where the agent downloads them. If the Dockerfile references files with incorrect paths, the build process will fail.

Here’s a simplified example of what the agent’s script might look like:

import wandb

# Initialize a W&B run
run = wandb.init()

# Use the artifact logged in a previous run
artifact = run.use_artifact('your_project/your_artifact:latest')

# Download the artifact to the local filesystem
artifact_dir = artifact.download()

# Now you have a local directory with your code, you can build a Docker image
# Make sure your Dockerfile is set up to copy files from `artifact_dir`
# For example, in your Dockerfile you might have:
# COPY /path/to/artifact_dir /app

# Build the Docker image using the Dockerfile and the downloaded artifact
# You would typically run a shell command like:
# docker build -t your_image_name /path/to/Dockerfile

If you’re using a W&B launch agent, you would specify the builder type as Docker in your launch-config.yaml file, and the agent would handle the building process according to the configuration you provide.

Make sure that:

  • The agent has access to the W&B API key to download artifacts.
  • The Dockerfile is correctly set up to use the files from the downloaded artifact directory.
  • The paths in the Dockerfile match the structure of the downloaded artifact directory.

Do let me know if you have additional questions.

Hi @mohammadbakir, thanks for getting back to me with a detailed response!

You are exactly correct in my goal. I’ve followed everything you mentioned currently and believe my issues lie in your second bullet point regarding “Artifact Download”, specifically downloading all the artifacts before building the docker image. Thank you for providing your simplified agent script! Where do I specify the name of this script for the launch agent to run before building the docker image?

Thanks!

Since this agent script, to download the files, will be existing in the same online wandb artifact as the other files, how will it be downloaded and ran? Ideally I don’t want this script file to live on the agent computer as i’d need to manually update the artifact = run.use_artifact('your_project/your_artifact:latest') line somewhat often.

I’m confused about a couple things from the CLI about “downloading artifacts” that hopefully you can provide some insight to:

  1. I have a Dockerfile.wandb file that exists in the parent directory of the wandb artifact and can confirm that the agent computer automatically downloads it and begins to compile the docker image, so there is already some pre-existing instructions for the launch agent to automatically download the dockerfile at least? Can this be modified to download all other files as well?
  2. In support of question 1. In the CLI, as the launch agent begins execution, it reads the correct number of total files and total folder size (~75MB) from the wandb artifact, but says it’s downloaded it in 0.7 seconds which is suspiciously fast. What is going on here?

Hi @mohammadbakir , wanted to follow up on this. Thanks!

Hi @19kdc3, my apologies for the long delay here. We’ve had an issue with our tracking systems that prevented tickets from syncing and this happened to fall through the cracks while we recovered.

The wandb agent can’t be programmed to download other files. You should run your code with log_code which will get you a Job in the project. You can then set up a queue with your docker args, and run agents on each machine where each machine must have docker installed and running. Then you can queue up jobs and the agent will build an image automatically. You can either have logic to download any files into the container post build. If there are files that will exist on your agent machine, then you can perhaps preemptively setup the dockerfile to copy files from a certain directory.

As to your second marker, this points to the files having existed in the cache. We have a cache folder (defined by the WANDB_CACHE_DIR env variable and an artifacts folder that gets created in the wandb directory. The artifacts folder is where all your artifacts are shown as downloaded when you call artifact.download() (unless you specify a download root in which case your artifacts download path will be in that root). When you try to download the files in your artifact,

  1. We first check if the files already exist in the final download location, i.e the artifacts folder or your root. If they do then we don’t download again
  2. If the files don’t exist in your download location, then we check the cache. If they exist in the cache then we just copy from the cache to the final download location
  3. If the files don’t exist in the cache, then we download the files to the cache first. And then we copy all the downloaded files from the cache to the final download location

It seems the files already exist in the cache, so we copy them over to the download directory. You can test whether your cache is working or not, you should:

  • run wandb artifact cache cleanup <specify size, e.g. 50 for 50 GB>
  • delete your artifacts folder (wandb/artifacts where your artifacts are getting downloaded)

then run your download. The download should take longer than 0.7 seconds

Hi @19kdc3 , just wanted to follow back up on this to see if you’ve had any questions on the above?

Hi @mohammadbakir, thanks for following up!

I’m currently working on this, but wandb seems to be giving me an error when I try to view a specific job:


Before the “no access” issue came up I’ve noticed a few things, though my testing is paused at the moment due to this error.

  1. My cache seemed to be stuck at some previous state and not downloading my new artifacts but I believe was fixed by my second point.
  2. I didn’t see a WANDB_CACHE_DIR in my windows environment variables, nor could I find the default cache file ~/.cache/wandb. But by manually creating the environment variable and pointing it to a specific folder I now see it creating the “cache” folder with the correct file size.

I’m currently experiencing back and forth issues between this and our other discussion on github.

@19kdc3 , if you authored the run/job you should have no problem viewing it. If you are part of a team where you are not an admin, and did not author the run job, then you won’t be able to access / update anything associated with the job. Could you provide a link to this job so I can check it. You may keep it private as I will have support access.

In regards to the WANDB_CACHE_DIR for windows, we cache all data in your home directory by default, but it seems windows might be mounting the tmp directory in a different location? The envar will override this so it seems to be working now? I will need to get access to a windows machine to test/repo the cache feedback and circle back.

I did response to the github issue as well. To verify, the only issues right now in this thread:

  1. Check why you are getting permissions errors with a job
  2. Check where windows is mounting tmp directory for caching

Hi @19kdc3 , following up on my response. Are you still experiencing issues with permissions when viewing your jobs? I unfortunately couldn’t get access to a window machine to try to reproduce the cache behavior you mentioned, but as mentioned in my last response, it seems setting the directory explicitly resolved any problems you encountered?

Hi @mohammadbakir , thanks for following up! The “no access permission” error seemed to have fixed itself. I took and break and came back to it working again… strange.

Yes, setting the cache manually as an environment variable in windows solved the issue of not being able to find the cache folder.

I think this topic can be closed for now, as the issues specified here “appear” to be resolved. I say appear because I’m encountering the docker issue we’re discussing on github which may be overshadowing this issue at the moment.

Hi @19kdc3 , we can continue the conversation in github for the discrepancies in how launch agents are utilizing uploaded dockerfiles. Will mark this closed.

Sounds great. Thanks for all the help @mohammadbakir !