I am trying to sweep my hyperparameters for my TensorFlow model. I am using Bayes as the sweeping method. My first run completes successfully, however the second run and the following runs do not start. Because of that, the sweep gets killed off. For the run that are killed off the error prompt says that the resource has been exhausted.
Run 7rog48g2 errored: ResourceExhaustedError()
What should I do?
Would you be able to share the kernel so we can check it out and debug it?
Yes I can. I don’t want to make the kernel public yet since it is part of an ongoing research project. But I can share with your Kaggle account and you can take a look
Hi Charles, sorry for the delay. I have shared the kernel and the dataset with you. Let me know if you face any problem accessing it
Took a look. I don’t think your problem is in the wandb part of the code.
Have you checked to make sure that you can run the function
sweep_train twice without hitting the same exhausted resource error?
I am thinking that maybe your cache step is causing the problem – try removing that, if you can, and seeing whether you still get the crash. Those resources might not be getting released in between runs.
I am not sure what you mean by running the
sweep_train function twice. But I am going to remove the cache step and try it again