I am currently running a sweep and with different configuration for a ResNET model. I noticed i was getting “CUDA OUT OF MEMORY” errors. This is more of a general question, but how can we manually handle wandb.errors specifically "Runtime Errors?
Let’s say I am loading a model and it runs out of memory, or idk like the shape is wrong. Wandb catches these errors, and moves on to either another run instance or tried to do it over and over. Is there a way i can wrap arround a try-except clause.
I tried wrapping my except clause as a Runtime exception, but it seems that it does not catch it.
Example Code:
try:
model = load_model(wandb.config, pipeline_parameters['model_type'])
except RuntimeError as e::
print('exception met')
# del X_train
# del Y_train
# del X_val
# del Y_val
gc.collect()
torch.cuda.empty_cache()
run.finish(exit_code=0)
return 1