Sweeps using more RAM memory

jymsungmi · September 11, 2023, 7:51am

Hi!

I’m using Sweeps to optimize hyperparameters while training a PyTorch Lightning model.
Currently I’m using parallel agents and each runs more than 10 jobs in a container, and all runs on kubernetes. (not the server itself. only job container is run on kubernetes)

Actually there is no big problem, but each container uses three times more cpu ram memory than a normal pytorchJob pod container.

At first, I thought anyway Sweeps using kind of training loop, and that’s why memory is increased. But I realized that it didn’t increased, it just use high memory from the first run.

Also, I’m using w&b as an experiment tracking purpose with the same model, but it didn’t use that much resources.
so I think Sweeps using some more resources, but I’m not sure and I don’t know how to decrease that.

I searched a lot, but I couldn’t find any related issues.
so If someone knows or experiences this kind of issue, let me know!

Thank you.

nathank · September 13, 2023, 7:55pm

Hi @jymsungmi, are you creating the sweep through Python or by defining a sweep.yaml and using the CLI to start the sweep?

I ask because the Python implementation uses threading rather than multi-processesing so the memory might not be being getting released at the end of each run.

Thank you,
Nate

jymsungmi · September 14, 2023, 4:39am

Hi @nathank ,

yes i’m creating the sweep through Python…
okay, then I need to check threading memory usage part.

but W&B using multiprocessing inside library code and it already define thread.join to finish the thread.
Why is my process not automatically release the memory?

Thank you!

nathank · September 19, 2023, 7:27pm

Hi @jymsungmi, I don’t know the exact cause of the memory not getting released when the thread ends but it seems to typically be resolved by using the CLI implementation of sweeps so that wandb uses multiprocessing.

I’ve seen this happen with Numpy arrays for example where in certain circumstances they don’t release RAM when the thread completes.

Would it be an option to refactor to defining the sweep with a yaml and using CLI to test that this does indeed solve the memory issue?

nathank · September 22, 2023, 4:47pm

Hi @jymsungmi, I just wanted to follow up and see if you had a chance to try this?

jymsungmi · September 25, 2023, 4:04am

Hi @nathank ,
thanks for your replying to help me.

I didn’t tried to deploy with CLI. I tried other things and now it seems fixed, so I’ve moved on.
I added gc.collect() to my training code, and then the process’s memory usage had not being increased.
also I changed deploy method.
before I deployed as just a pod, but now I deployed training as a PytorchJob in k8s.

Thank you once again!

system · November 24, 2023, 4:05am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Memory Error Cause Fail Sweep Runnings W&B Help wandb	3	521	January 14, 2022
Sweep agent will always start another run after finishing (on SLURM) W&B Help sweeps	4	297	July 3, 2024
Hyperparameter sweep on kaggle notebooks using Wandb fails to complete Show the Community! wandb	6	777	September 20, 2021
Multiprocessing mp wandb sweeps and the count parameter, how to do sweeps with mp? W&B Help sweeps	6	467	June 3, 2024
Multithreading support for Sweeps W&B Help sweeps , wandb	10	1361	January 1, 2024

Sweeps using more RAM memory

Related topics