How to keep only last checkpoint artifact?

turian · August 27, 2022, 8:35am

How do I keep only the last checkpoint artifact in wandb?

I am using lightning’s ModelCheckpoint to periodically save my checkpoint artifact to wandb. However, these artifacts are really large. If I keep multiple checkpoint artifact versions on wandb, they get big really quickly.

However, I can’t just checkpoint at the end of training. My GPUs occasionally terminate, so I need to checkpoint periodically.

How do I make sure that only the last checkpoint artifact is kept on wandb?

matt24 · August 28, 2022, 12:52pm

Hey @turian,
You need to define a custom checkpoint callback which is straightforward:

from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import ModelCheckpoint

# define WANDB logger
wandb_logger = WandbLogger(log_model="all")

# define pytorch lightning checkpoint callback
checkpoint_callback = ModelCheckpoint(every_n_epochs=1)

# define trainer
trainer = Trainer(logger=wandb_logger, callbacks=[checkpoint_callback])

In this example, the checkpoint will be saved at the end of each epoch, but you can set whatever value you want. And if you want to save the checkpoints based on steps or time, you just need to set every_n_train_steps or train_time_interval, respectively.

If you’re looking for more specific information, I highly recommend you to check out the official docs:

Hope it does help you.

_scott · August 29, 2022, 9:17am

Also, if you want to delete artifacts after training, you can use the wandb.Api.

import wandb

"""
deletes all models that do not have a tag attached

by default this means wandb will delete all but the "latest" or "best" models

set dry_run == False to delete...
"""
project_name='demo-project'
entity='_scott'
dry_run = True
api = wandb.Api(overrides={"project": project_name, "entity": entity})
project = api.project(project_name)
for artifact_type in project.artifacts_types():
    for artifact_collection in artifact_type.collections():
        for version in api.artifact_versions(artifact_type.type, artifact_collection.name):
            if artifact_type.type == 'model':
                if len(version.aliases) > 0:
                    # print out the name of the one we are keeping
                    print(f'KEEPING {version.name}')
                else:
                    print(f'DELETING {version.name}')
                    if not dry_run:
                        version.delete()

Source for this snippet:

system · August 31, 2022, 9:15am

Hi Joseph, thanks for your question! Would the solutions proposed by Matteo and Scott work for you?

system · September 6, 2022, 10:27am

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

turian · September 10, 2022, 9:12am

Hi @_scott thanks for the code. As I mentioned here, this doesn’t appear to delete the artifacts any more, even with dry run disabled.

system · November 9, 2022, 9:12am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Easiest way to load the best model checkpoint after training w/ pytorch lightning W&B Help pytorch	4	5586	January 6, 2023
PyTorch Lightning WandbLogger how to save top K checkpoints + last checkpoint to GCS? W&B Help wandb , pytorch	5	1789	February 10, 2024
Store trained models without wandb as artifacts W&B Help artifacts , wandb	4	722	September 12, 2022
Trying to access model checkpoint raises wandb.errors.CommError W&B Help artifacts , wandb	6	960	May 17, 2024
Downloading an artifact deletes the summary of the last run in the sweep W&B Help	7	479	February 17, 2024

How to keep only last checkpoint artifact?

Related topics