Ok… I messed up.
I have a script that automatically deploys an ML training job from a job queue. I wasn’t properly handling the error. After a particular job, the job hit the Cuda memory error, but my auto-launch script didn’t detect it. It kept either retrying the same job or picking up new jobs, which would fail anyway (same error).
Now, my wanb workspace has 277 runs with unique IDs. Can anyone help me find a way to delete these runs programmatically? I can identify the ID either by name or by reading the log file.
I was able to write a python script to do it. So, this thread can be closed. I will paste the code for others who want to achieve the same thing.
import wandb
import os
project_path = "<entity>/<project>"
log_file = "output.log"
api = wandb.Api()
# get all runs in the project
runs = api.runs(project_path)
for run in runs:
# check if the run has the error in the logs
try:
log = run.file(log_file).download(replace=True)
with open(log.name, "r") as f:
if "torch.OutOfMemoryError: CUDA out of memory. Tried to allocate" in f.read():
print(f"Deleting run {run.id} with error. It's name is {run.name}")
run.delete()
else:
print(f"Run {run.id} does not have the error")
except Exception as e:
print(f"Run {run.id} does not have the error or does not have the log file. Error: {e}")
# delete output.log file
if os.path.exists(log_file):
os.remove(log_file)
else:
print(f"{log_file} does not exist")