Trying to access model checkpoint raises wandb.errors.CommError

Hey community!

I’m working on a ML pipeline using Pytorch Lightning and W&B.

I’m trying to access model checkpoints that were pushed into W&B using lightning.pytorch.loggers.wandb.WandbLogger. The checkpoints are correctly shown in the W&B web UI, as well as locally on the machine that was used for training.

When trying to access the model checkpoint similarly as instructed here and here using either the wandb.Api() or lightning.pytorch.loggers.wandb.WandbLogger, respectively, both cases result in a wandb.errors.CommError, stating:

wandb.errors.CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)

I’m working on a W&B team project, that was setup by our administrator, so I’m thinking whether there’s something in the permissions side that the admin has to set? I also tried accessing the checkpoint via wandb.init() in offline-mode to prevent my trials from clogging our W&B project space, but downloading artifacts in offline mode doesn’t seem to be a thing.

Here’s an example code snippet:

#!/usr/bin/env python3

from lightning.pytorch.loggers import WandbLogger
import wandb

if __name__ == "__main__":
    user = "<TEAM-NAME>" # Note, this is different to my W&B user name
    project = "<PROJECT-NAME>"
    run_name = "<RUN-NAME>"
    alias = "<ALIAS>"

    checkpoint_reference = f"{user}/{project}/{run_name}:{alias}"

    ## DOWNLOAD ARTIFACT

    # use wandb.Api()
    # https://docs.wandb.ai/ref/python/public-api/api
    api = wandb.Api()
    artifact = api.artifact(checkpoint_reference, type="model")  # PERMISSION DENIED!

    # use WandbLogger
    # https://docs.wandb.ai/guides/integrations/lightning#model-checkpointing
    wandb_logger = WandbLogger(project=project, save_dir="", offline=True)
    wandb_logger.download_artifact(
        checkpoint_reference, artifact_type="model"
    )  # PERMISSION DENIED!

    # trial - use wandb.init()
    # https://docs.wandb.ai/guides/integrations/lightning#model-checkpointing
    run = wandb.init(mode="offline")
    artifact = run.use_artifact(
        checkpoint_reference,
        type="model",
    )  # -> CAN'T ´run.use_artifact()´ IN OFFLINE MODE!
    artifact_dir = artifact.download()

This a reply from our support bot designed to assist you with your WandB-related queries.
To reach a human please reply to this message.

It seems like you’re encountering a wandb.errors.CommError due to permission issues when trying to access model checkpoints in a Weights & Biases (W&B) team project. Let’s address your sub-queries to help resolve the issue:

How to resolve wandb.errors.CommError when accessing model checkpoints?

The CommError you’re experiencing suggests a permissions issue. Ensure that you have the necessary permissions to access the artifacts in the W&B project. If you’re part of a team project, the project administrator may need to adjust your permissions to allow access to the artifacts. Reach out to your W&B project administrator to confirm your access level.

What permissions are required to access model checkpoints in a W&B team project?

In a W&B team project, you need to have at least “Viewer” permissions to access the project’s artifacts, including model checkpoints. However, to download or manipulate artifacts, “Contributor” or higher permissions might be required. Again, your project administrator can adjust these permissions in the project settings on the W&B web UI.

Is it possible to download artifacts in offline mode with W&B?

No, downloading artifacts in offline mode is not supported by W&B. The offline mode is primarily designed for logging data without an immediate internet connection, and the data is synced the next time you connect to W&B online. To download artifacts, you need to be in an online mode and have appropriate permissions to access those artifacts.

How to integrate Pytorch Lightning with W&B for model checkpointing?

To integrate Pytorch Lightning with W&B for model checkpointing, you can use the WandbLogger from Pytorch Lightning, which automatically logs metrics, model checkpoints, and other artifacts to your W&B project. Here’s a simplified example:

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import ModelCheckpoint

# Initialize W&B logger
wandb_logger = WandbLogger(project="<PROJECT-NAME>", entity="<TEAM-NAME>")

# Model checkpointing
checkpoint_callback = ModelCheckpoint(
    dirpath="checkpoints",
    filename="{epoch:02d}-{val_loss:.2f}",
    save_top_k=3,
    verbose=True,
    monitor='val_loss',
    mode='min'
)

# Initialize Trainer
trainer = Trainer(logger=wandb_logger, callbacks=[checkpoint_callback])

# Train model
trainer.fit(model, train_dataloader, val_dataloader)

This setup logs your training metrics and model checkpoints to W&B. Ensure you replace <PROJECT-NAME> and <TEAM-NAME> with your actual project and team names.

In summary, to resolve your issue, check your permissions, ensure you’re in online mode when trying to download artifacts, and follow the best practices for integrating Pytorch Lightning with W&B. If you continue to face issues, consider reaching out to W&B support for more personalized assistance.

-WandBot :robot:

Thanks for the (automated) instructions.

I checked our W&B team page membership roles from an URL similar to https://wandb.ai/<TEAM-NAME>/members, and my role appears as Member.

According to the Teams Artifacts Docs, a `Team Member´should have permissions to download artifacts.

As such, it would appear to me that the membership roles are correct?

Also, guided by the ealier post, I tried adding entity=user when initializing the WandbLogger, but this didn’t change the behavior.

wandb_logger = WandbLogger(entity=user, project=project, save_dir="", offline=True)

Not sure if I should Reply to this message explicitly to get human assistance, so just in case …

I also tried the wandb CLI as a workaround, but getting a similar looking error as a response:

Error: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)

Here’s the code snippet:

#!/bin/bash

user="<TEAM-NAME>"
project="<PROJECT-NAME>"
run_name="<RUN-NAME>"
alias="<ALIAS>"

checkpoint_reference="$user/$project/$run_name:$alias"
echo "Downloading $checkpoint_reference"

wandb artifact get $checkpoint_reference

Hey @01tot10! Thank you for writing in and apologies for the delay in reply. Could you please send me a link to the checkpoint you are interested in loading?

Could you also send me a filled out code snippet of actually trying to access the checkpoint, with entity project name etc

Hey @artsiom!

Thanks for getting back on the topic, and no worries regarding the delay! I was working around the problem by pulling the weights directly from the machine which did the training, which bypasses having have to use the wandb APIs.

It’s a bit embarassing to say, but compiling yet another minimal example for you to debug and looking at the issue with fresh eyes made me actually resolve the problem!

Basically, it seems all artifact types appends a little extra string to the version name to differentiate between them, and thus the original <RUN-NAME> is not valid. More specifically, assume a run that was logged via PyTorch Lightning WandbLogger using:

wandb_logger = WandbLogger(
        entity="<TEAM-NAME>",
        project="<PROJECT-NAME>",
        version="<RUN-NAME>",
)

which can be accessed from the wandb web UI as:
https://wandb.ai/<TEAM-NAME>/<PROJECT-NAME>/runs/<RUN-NAME>.

To access the models from the aforement run, the checkpoint_reference shouldn’t be

<TEAM-NAME>/<PROJECT-NAME>/<RUN-NAME>:<ALIAS>`

as originally proposed, but rather:

<TEAM-NAME>/<PROJECT-NAME>/model-<RUN-NAME>:<ALIAS>

Thus, changing checkpoint_reference in the original code snippet to

checkpoint_reference = f"{user}/{project}/model-{run_name}:{alias}"

fixes the issue…

This would have been reflected in the instructions that are given in the usage tab of a given artifact,
https://wandb.ai/<TEAM-NAME>/<PROJECT-NAME>/artifacts/model/model-<RUN-NAME>/<ALIAS>/usage
but the URL had become so long, that my eyes had missed the change in the run name…

Anyways, it appears that all is good, and I could potentially change our code base to use the wandb APIs!

1 Like