Launch-agent crash without trace or error log

Hi Wandb,

Goal:
Run a launch job from the wandb website to a launch-agent. The launch-agent compiles a docker image from downloaded artifacts and executes a training run in an environment in the docker image.

Issue:
The launch-agent “finishes” a launch job without any indication of a fail/crash. However, the logged run on the wandb website shows no logged results and the state “crashed”. The docker file entry point script however can run successfully on it’s own, separate of using launch, logging data to wandb and actually successfully completing the run. The “error.log”, “debug.log”, and “debug-internal.log” files have no indication of any errors either.

Dockerfile.wandb:

FROM python:3.8.12
WORKDIR /launch/Development
RUN apt-get update && apt-get install -y libgl1-mesa-glx

ENV WANDB_API_KEY=...

# copy requirements file to build environment
COPY Development/requirements.txt /launch/Development/requirements.txt

# install python dependencies
RUN pip install -r requirements.txt
RUN pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117

# copy the remaining project files
COPY . /launch 

ENTRYPOINT ["python", "learn_manual.py"]

learn_manual.py:

class Agent:
    def __init__(self, config, project, policy_kwargs):
        self.config = config
        self.policy_kwargs = policy_kwargs
        self.temp_dir = None
        
        #create temp working directory for model.zip
        self._create_temp_folder()
        
        #initialize wandb
        self.run = wandb.init(project=project, mode="online", config=self.config, sync_tensorboard=True)

        #generate environment
        self.env = DummyVecEnv([self._make_train_env]) #vectorize environment
        # self.env = make_vec_env(self._make_train_env, n_envs=2) #trying multiple environments

        #generate model
        self.model = SAC(
            policy=wandb.config.policy_type,
            env=self.env, 
            policy_kwargs=self.policy_kwargs, 
            # tensorboard_log=self.files_dict['log_dir'], 
            # action_noise= NormalActionNoise(mean=np.zeros(self.env.action_space.shape[-1]), sigma=0.1*np.ones(self.env.action_space.shape[-1])) if wandb.config.action_noise == True else None,
            # learning_starts=wandb.config.learning_starts,
            # learning_rate=wandb.config.learning_rate, 
            buffer_size=wandb.config.buffer_size,
            batch_size=wandb.config.batch_size

        return

    def _make_train_env(self):
        #create environment
        env = PackCoordinator(skip_topo=True).create_env(env_render_mode="none")

        #wrap envirnoment
        env = Monitor(env)
        env = TrainLogging(env)

        return env

    def _create_temp_folder(self):
        self.temp_dir = tempfile.TemporaryDirectory()
        
        return

    def train(self):
        '''
        Train and save the model
        '''
        self.model.learn(
            total_timesteps=wandb.config.total_timesteps, 
            log_interval=wandb.config.log_interval, 
            reset_num_timesteps=True, 
            progress_bar=True, 
            callback=WandbCallback(
                model_save_path=Path(self.temp_dir.name) #upload saved model file to wandb on completion
                )
            )
        
        self.model.save(Path(self.temp_dir.name)) #save model.zip to temp folder directory

        return
    
    def evaluate(self, eval_eps, num_samples):
        eval_run = Evaluate(self.model)
        eval_run.evaluate(eval_eps, num_samples)
        
        return
    
    def end_run(self):
        self.run.finish()

        self.temp_dir.cleanup() #clear the temp directory for uploaded model.zip

        return
    
config = {
	"log_interval": 1,
    "algo": "sac", #sac, td3, ppo, a2c
	"policy_type": "CnnPolicy", #MultiInputPolicy, CnnPolicy Note: Number of channels in ai_settings.py obs_shape will need to be adjusted
	"total_timesteps": 256, #less than 100 steps, train metric won't log
    # "learning_starts": 100, 
    "learning_rate": 0.0003,
    "buffer_size": 256,
    "batch_size": 64
}

policy_kwargs = dict(
    features_extractor_class = CustomCNN,
    features_extractor_kwargs = dict(features_dim=ai_set.grid_mapper_dim[0]),
    normalize_images = False
)

sac_model = Agent(project="test", config=config, policy_kwargs=policy_kwargs) #initialize the model to be trained
sac_model.train() #train the model
sac_model.evaluate(eval_eps=20, num_samples=5) #evaluate the trained model
sac_model.end_run() #end the wandb run

Hi

The core issue seems to be a disconnect between how your training script works locally and how it behaves within the WandB Launch Agent and Docker container. Just to clarify, this is what you are seeing:

  • Successful Local Run: When you run learn_manual.py directly, everything functions correctly. This would indicate that your training script, dependencies, and environment setup (outside of Docker) are solid.
  • Crashed Launch Job: The “crashed” state in W&B suggests that the training process within the Docker container either terminated abruptly or encountered an error that didn’t get explicitly logged.

Lets start by trying some debugging Strategies. We can enhance our logging by adding more detailed logging to your learn_manual.py script to capture:

  • Environment Variables: Ensure WANDB_API_KEY and any other necessary variables are correctly set within the Docker container. Use print(os.environ) to examine the environment.
  • Training Progress: Log metrics, iteration numbers, and other progress indicators frequently. This will help pinpoint where the crash occurs.
  • Exceptions: Wrap your training code in a try-except block to catch and log any exceptions that may not be showing up in standard logs.

We can also do some Docker image inspection.

  • Interactive Shell: Launch your Docker container with docker run -it <your_image_name> bash. This gives you an interactive shell within the container to check file permissions, installed packages, and environmental variables.
  • Run Script Directly: Once inside the container, try running python learn_manual.py directly (without WandB Launch). This can isolate whether the issue is specific to the WandB integration.

We should also check a couple things for our W&B Launch Agent Configuration.

  • Resource Limits: Ensure the agent has sufficient resources (CPU, memory) to handle the training process.
  • Job Timeout: Check if there’s a timeout setting in your launch configuration. If the training takes longer than expected, the job might be prematurely terminated.

Modified Dockerfile Example:

FROM python:3.8.12
WORKDIR /launch/Development

# ... (rest of your Dockerfile)

ENTRYPOINT ["python", "-u", "learn_manual.py"]  # -u for unbuffered output
The -u flag in the ENTRYPOINT ensures that log messages are sent to stdout immediately and not buffered, which can help with debugging.
  
Python
import wandb
import logging

# ... (rest of your code)

def train(self):
    logging.basicConfig(level=logging.INFO)  # Set logging level

    logging.info("Starting training...")  
    logging.info(f"Environment variables: {os.environ}") #check if wandb api key is set

    # ... (your training logic)

    try:
        # ... (training code)
    except Exception as e:
        logging.error(f"Exception during training: {e}")
        raise  # Re-raise the exception so it's caught by WandB

Let’s start here and see if we can’t find more clues as to what might be going on here. Let me know and I look forward to working with you!

Best,
Jason

Hi @jason-arkens17,

Thanks so much for your detailed response and examples! They were extremely helpful.

I’ve implemented all your suggestions (except for the “Docker image inspection” which I’m currently working on). Here are the outcomes of each point:

Environment Variables: Ensure WANDB_API_KEY and any other necessary variables are correctly set within the Docker container. Use print(os.environ) to examine the environment.

I confirmed through this that an API key is set and that the key is verified with my account. So I think we can rule this out.

Training Progress: Log metrics, iteration numbers, and other progress indicators frequently. This will help pinpoint where the crash occurs.

I currently log every episode, each episode contains 3 steps. Additionally I can see my print statements in the terminal window so I can pin point from there where it crashes.

Exceptions: Wrap your training code in a try-except block to catch and log any exceptions that may not be showing up in standard logs.

Added this into the “train” function in the learn_manual.py as per your sample code. No errors have turned up from the “except” yet.

Resource Limits: Ensure the agent has sufficient resources (CPU, memory) to handle the training process.

As far as I know it should have enough resources. I haven’t changed any defaults in Docker so I believe it has access to as much resources as it needs.

Job Timeout: Check if there’s a timeout setting in your launch configuration. If the training takes longer than expected, the job might be prematurely terminated.

I can’t rule this out yet as the code seems to keep failing in the same spot (and duration). I tried modifying the docker ENTRYPOINT to ["timeout", "300", "python", "-u", "learn_manual.py"] knowing that my script shouldn’t take longer than 5 mins to run (crashes in about 18 seconds from start, 10 second from when the docker run command is executed). This didn’t have effect.

What I’ve discovered:
Littering my script with print statements (logging.info stopped working after the first run for some reason and no longer prints anything to the terminal) I’ve determined that the script “crashes” some where in the self.model.learn() function (this is a function from the Stable Baselines3 python package I’ve decided to use for model training) called in the train(self) function in my originally posted learn_manual.py script. These are all the leads I have at the moment.

I’m going to litter the self.model.learn() function with print statement so I can further pinpoint the issue. I’m still not sure why I’m not seeing any error statements in the docker terminal window, for example as I do in the VScode terminal.

edited: model.learn() is a function in the Stable Baselines3 python package, and not wandb.

Update:
Following your advice to interact and debug through Docker I was able to view the logs which show the learn() function runs for ~10 iterations (different number each time) before crashing. Additionally, looking at the inspect tab, I’ve determined that the container is exiting with exitcode 139. I’d tried implementing the .wslconfig file, found here, with kernelCommandLine = vsyscall=emulate added, but still no change. Will update further once I can solve this exitcode.

Sounds perfect! Let me know how things go :slightly_smiling_face:

Thanks to Jason for directing me to inspecting the Docker image. From the docker desktop app I was able to inspect and log the script as it ran where the errors were more detailed than would show in the output logs.
The issue was entirely resolved in my Dockerfile.wandb file where I added a few items:

  1. RUN gdb to get more detailed error outputs
  2. RUN xvfb, xvfb ended up being a missing module from my script required for the rendering of my environment (as my script logs images of the environment).
  3. The final key was a modified ENTRYPOINT: ENTRYPOINT ["sh", "-c", "xvfb-run python my_learning_script.py"]

That is excellent and awesome to hear! I am so glad we were able to help solve this one for you. Hope you have a great rest of the week and don’t hesitate to reach out in the future if you ever need more help :slightly_smiling_face: