I’ve ran into a problem that I’m not sure if it even has a solution? As I run my model during training, I periodically create snapshots of the current state of the model and save them in the run. So my runs have file like:
model-snapshot-1.pth
model-snapshot-2.pth
…
in them. In the end of the training process I save the final state of the model and upload it as an artifact. Sometimes these runs crash during training, and the artifact creation process is not complete. In these cases the intermediate snapshots become value-able. I was wondering if there is a way to promote these run specific files into artifacts?
I’ve managed to create a function that does the trick for me. The only downside is that this function will cause crashed runs to lose the crash detail which I don’t mind. if someone could comment on it, it would be nice.
def upload_missing_artifacts(project):
epoch_re = re.compile(r'model-.*-(\d+)\.pth')
wandb_api = wandb.Api()
for run in wandb_api.runs(project):
run: wandb.apis.public.Run = run
if run.state == 'running':
continue
artifacts = run.logged_artifacts()
logged_model = False
for art in artifacts:
art: wandb.Artifact = art
if art.name.startswith('final_model'):
logged_model = True
if logged_model:
continue
best_file = None
best_score = None
files = run.files()
print (run.name)
for file in files:
file:wandb.apis.public.File = file
match = re.match(epoch_re, file.name)
if match:
epoch = int(match.group(1)) * 1000 + 1
history = next(run.scan_history(min_step=epoch, max_step=epoch+3))
score = model_score(history)
print (f'\t{epoch} -> {score}')
if best_file is None or model_score(history) > best_score:
best_score = score
best_file = file
if best_file:
print (f'\tbest epoch = {best_file}')
artifact = wandb.Artifact('final_model', type='model')
artifact.add_reference(best_file.url, name='trained.pth')
with wandb.init(project=run.project, id= run.id, resume='must') as run:
run.log_artifact(artifact)