Sweeps using more RAM memory


I’m using Sweeps to optimize hyperparameters while training a PyTorch Lightning model.
Currently I’m using parallel agents and each runs more than 10 jobs in a container, and all runs on kubernetes. (not the server itself. only job container is run on kubernetes)

Actually there is no big problem, but each container uses three times more cpu ram memory than a normal pytorchJob pod container.

At first, I thought anyway Sweeps using kind of training loop, and that’s why memory is increased. But I realized that it didn’t increased, it just use high memory from the first run.

Also, I’m using w&b as an experiment tracking purpose with the same model, but it didn’t use that much resources.
so I think Sweeps using some more resources, but I’m not sure and I don’t know how to decrease that.

I searched a lot, but I couldn’t find any related issues.
so If someone knows or experiences this kind of issue, let me know!

Thank you.

Hi @jymsungmi, are you creating the sweep through Python or by defining a sweep.yaml and using the CLI to start the sweep?

I ask because the Python implementation uses threading rather than multi-processesing so the memory might not be being getting released at the end of each run.

Thank you,

Hi @nathank ,

yes i’m creating the sweep through Python…
okay, then I need to check threading memory usage part.

but W&B using multiprocessing inside library code and it already define thread.join to finish the thread.
Why is my process not automatically release the memory?

Thank you!

Hi @jymsungmi, I don’t know the exact cause of the memory not getting released when the thread ends but it seems to typically be resolved by using the CLI implementation of sweeps so that wandb uses multiprocessing.

I’ve seen this happen with Numpy arrays for example where in certain circumstances they don’t release RAM when the thread completes.

Would it be an option to refactor to defining the sweep with a yaml and using CLI to test that this does indeed solve the memory issue?

Hi @jymsungmi, I just wanted to follow up and see if you had a chance to try this?

Hi @nathank ,
thanks for your replying to help me.

I didn’t tried to deploy with CLI. I tried other things and now it seems fixed, so I’ve moved on.
I added gc.collect() to my training code, and then the process’s memory usage had not being increased.
also I changed deploy method.
before I deployed as just a pod, but now I deployed training as a PytorchJob in k8s.

Thank you once again!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.