Creating an Artifact from files saved into run

Hi All,

I’ve ran into a problem that I’m not sure if it even has a solution? As I run my model during training, I periodically create snapshots of the current state of the model and save them in the run. So my runs have file like:

model-snapshot-1.pth
model-snapshot-2.pth

in them. In the end of the training process I save the final state of the model and upload it as an artifact. Sometimes these runs crash during training, and the artifact creation process is not complete. In these cases the intermediate snapshots become value-able. I was wondering if there is a way to promote these run specific files into artifacts?

I’ve managed to create a function that does the trick for me. The only downside is that this function will cause crashed runs to lose the crash detail which I don’t mind. if someone could comment on it, it would be nice.

def upload_missing_artifacts(project):
	epoch_re = re.compile(r'model-.*-(\d+)\.pth')
	wandb_api = wandb.Api()

	for run in wandb_api.runs(project):
		run: wandb.apis.public.Run = run
		if run.state == 'running':
			continue
		artifacts = run.logged_artifacts()
		logged_model = False
		for art in artifacts:
			art: wandb.Artifact = art
			if art.name.startswith('final_model'):
				logged_model = True
		if logged_model:
			continue

		best_file = None
		best_score = None
		files = run.files()
		print (run.name)
		for file in files:
			file:wandb.apis.public.File = file
			match = re.match(epoch_re, file.name)
			if match:
				epoch = int(match.group(1)) * 1000 + 1
				history = next(run.scan_history(min_step=epoch, max_step=epoch+3))
				score = model_score(history)
				print (f'\t{epoch} -> {score}')
				if best_file is None or model_score(history) > best_score:
					best_score = score
					best_file = file
		if best_file:
			print (f'\tbest epoch = {best_file}')
			artifact = wandb.Artifact('final_model', type='model')
			artifact.add_reference(best_file.url, name='trained.pth')

			with wandb.init(project=run.project, id= run.id, resume='must') as run:
				run.log_artifact(artifact)

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.