Deleting data from self-hosted server

I saw this post, but it doesn’t answer my question. We are running a self-hosted wandb instance. We have somewhat limited space, though we’ve been good about deleting old runs through the interface.

We thought deleting from the interface would also delete the physical files from the hard disk, but that doesn’t appear to be the case. For example, going into the minio folder on the server shows a particular project taking up 130 GB of space, while wandb reports 30 GB. That’s a big difference!

How do we really really delete files from minio that wandb no longer knows anything about?

Suggestion: please have a central /usage/ URL for admins to look through usage from all teams and users rather than having to go through each one individually!

@tkott, I’m looking into this but I believe there isn’t a way through the UI to physically delete the files from your hard drive. I’ll look into this more to confirm this and if this is indeed the case, I can put in a feature request for you to make this possible through the UI.

Thank you,
Nate

Hi Nate – so how do you suggest that we free up space generally then? Do we have to note the hash / digest of files and look for them on the minio server by hand?

I would expect that if you “delete” from the server with a big scary “this is a permanent operation” type warning, that the server will, in fact, remove the files. If it doesn’t remove the files, it isn’t actually a permanent operation since the files can be salvaged (albeit with some work).

image001.jpg

@nathank any suggestions for how do it programmatically through wandb in the meantime?

Hi @tkott,
Sorry for the delay here. A member of our team put together a script to do this for you here.

You can use this by running python file_cleanup.py -d 10 where the -d tag can specify how many days ago a run has to have been deleted from the UI in order for the script to delete it from Minio storage.

Please note that this will only work if you are using our bare-metal Docker container setup. If you have connected an external database this will not work.

Also, we are working towards making this a default part of our local deployment where you can set a retention policy for deleted runs and the server will automatically clean up deleted runs after a certain amount of time .

Thank you,
Nate

Thanks! If I understand that gist right, I first need to use the UI to delete the run. At that point, the run files are still on the server. At that point, I can run this script (or maybe set to run weekly). When the script finds run that were deleted more than 10 days ago (via -d 10 option), it will look for their references and delete files that live in the parent folder found ( "/vol/minio/local-files/{}/{}/{}".format(entity_name, project_name, run_id)).

So a couple of questions:

  1. What happens to the artifacts associated with runs?

  2. What happens with child folders? (And why isn’t it all files and folders within the parent folder?)

  3. If child directories are present, are they now orphaned with no pointer from the database?

Thanks!

image001.jpg

  1. The artifacts are not deleted, if there is an artifacts directory in the object storage we still keep it and delete all the rest of the files.
  2. All child folders are also deleted in the object storage. We don’t delete Artifacts folder because artifacts are not just used by runs but also by other parts of the product so we keep them to avoid breaking things in other parts of the app.
  3. There is no connection with the runs table in the database once the deletion happens but there might be other tables that are still referring to these artifacts.

This can be run from outside the container or inside the container. But in either case it expects to have the mysql-connector-python python package installed on the container. You can login to the container using docker exec -it wandb-local bash and run pip install like this to install the dependency.

Thanks for confirming those questions!

Can you also then recommend a way of deleting older versions of artifacts? (which you can do from the UI as well, but presumably this doesn’t affect the underlying store in the same way that deleting runs doesn’t remove them from the store.)

image001.jpg

1 Like

@a-sh0ts or @nathank – any suggestions on dealing with artifacts in a similar way?

1 Like

Hey @tkott , apologies for the delay here.
You can delete artifact versions programmatically via our public API. Here’s a sample script doing the same:

api = wandb.Api()
project = api.project('project_name')

for artifact_type in project.artifacts_types():
    for artifact_collection in artifact_type.collections():        
        for version in artifact_collection.versions():
            if artifact_type.type == 'dataset':
                if len(version.aliases) > 0:
                    # print out the name of the one we are keeping
                    print(f'KEEPING {version.name}')
                else:
                    print(f'DELETING {version.name}')
                    if not dry_run:
                        print('')
                        version.delete()

Please let me if this helps.

1 Like

@anmolmann thanks and I think that’s fine for deleting it from the UI, but my understanding is that this would not delete it from the actual physical drive through the minio interface. How do I do that?

@tkott , we identified this as a bug where deleting the artifact versions via API or UI wouldn’t actually delete them from the object storage as well. Our team is working on a fox for this and I’ll keep you posted as soon as i’ve an update on this issue.
Meanwhile, you can delete these artifact versions via a script similar to this one. You should remove lines 44 and 45, and update the elif in line 41 to elif os.path.isdir(f) so that if the folder contains any artifacts then that folder is deleted as well. Also, you should add one more dir path, maybe have a list of dir_paths in line 31 as artifacts are also stored in wandb_artifacts in local-files. So, your additional dir_path would be dir_path_2 = "/vol/minio/local-files/wandb_artifacts/{}/{}".format( idx_1, idx_2), where idx_1 is the artifact index and idx_2 is the artifact version index.
Apologies for the inconvenience caused here as I do acknowledge this workaround would be a hacky way to get around this issue.

@anmolmann Thanks for the suggestion. I’m a little confused because while we’ve used the gist you point out to remove old runs (yay!), if I understand it the artifacts are a separate thing. I don’t know which run to delete that would also (when the changes to the gist are made as you suggest) delete a specific artifact version, and only that specific artifact version. I’m also not sure that we necessarily want to delete the run at the same time as the artifact. Can you elaborate a bit more about the relationship between artifacts and runs and how the gist would handle that?

~WRD000.jpg

image001.jpg