Wandb sweeps running on Kaggle GPU or Colab GPU are much slower than on my local CPU

teamtom · September 27, 2021, 10:02am

Hi there,

i have run a few sweeps on my local computer and the same sweeps on Kaggle and Colab

i have an i7 (10th gen) CPU in my home computer but no GPU
i measured around 50secs for 100 epochs (1 run)

on Kaggle and Colab the same 100 epochs took 2mins 30secs (Colab) and ~3mins (Kaggle) using GPU

how is that possible? am i doing something wrong?

i observed this extreme slowdown only when using W&B Sweeps
no slowdown when running single experiments

please help, any idea appreciated!

charlesfrye · September 27, 2021, 3:48pm

Without seeing the code or a W&B workspace, it’s hard to say, but here are some thoughts:

Running the agent is not enough overhead to cause a slowdown like you’re seeing, so I suspect there’s something else happening.
Just switching to a GPU doesn’t always speed you up, especially with eager evaluation. Selecting and launching kernels can take longer than executing them on the device.
Could you check the per-epoch iteration time versus the total experiment time? If that has increased by 50%, we’ll need to look inside each epoch for the cause of slowdown.
If the per-epoch time hasn’t gone up, it’s likely that the end of the run is where you’re getting the slowdown. This could be because you’re logging more information during the sweep runs (large numbers/GB of media files, model files). The run won’t terminate until everything has been uploaded.

teamtom · September 28, 2021, 7:54am

thank you for your answer!
i will try to run a few experiments with detailed timing data and get back

teamtom · October 16, 2021, 6:23pm

Hi,

i have run two different experiments with several runs and sweeps and shared everything using GitHub: GitHub - teamtom/kaggle-vs-colab-speed: testing CPU/GPU speeds
this Github repo contains the Jupyter files and my pipeline code

The original slowdown were noticed with the Churn imbalanced dataset. I created a logistic regression model with a very simple network and tried to find a better threshold to improve the model with W&B sweeps.

I found that using GPU on either Kaggle or Colab the epoch time was way slower compared to running on CPU (on local machine, Kaggle and Colab)
eg. Colab GPU was 2 times slower than Colab CPU or my local CPU (I7, 10th gen Intel)
Kaggle GPU was better than Colab’s but still slower than my local CPU

I also run another experiment: it was MNIST dataset with a CNN to compare. This experiment contained solo runs and sweeps.
i observed no GPU slowdown, the acceleration proved to be ~3x using Kaggle; Colab was only ~2x

so the questions:

why GPU proved way slower than CPU with the first experiment? Is it related to the problem itself?
is there any error in my models, calculations or in the pipeline?

please help to understand, thank you!

teamtom · October 16, 2021, 6:32pm

Hi,

i have run two different experiments with several runs and sweeps and shared everything using GitHub: GitHub - teamtom/kaggle-vs-colab-speed: testing CPU/GPU speeds

this Github repo contains the Jupyter files and my pipeline code also my related W&B projects

The original slowdown were noticed with the Churn imbalanced dataset. I created a logistic regression model with a very simple network and tried to find a better threshold to improve the model with W&B sweeps.

I found that using GPU on either Kaggle or Colab the epoch time was way slower compared to running on CPU (on local machine, Kaggle and Colab)
eg. Colab GPU was 2 times slower than Colab CPU or my local CPU (I7, 10th gen Intel)
Kaggle GPU was better than Colab’s but still slower than my local CPU

I also run another experiment: it was MNIST dataset with a CNN to compare. This experiment contained solo runs and sweeps.
i observed no GPU slowdown, the acceleration proved to be ~3x using Kaggle; Colab was only ~2x

so the questions:

why GPU proved way slower than CPU with the first experiment? Is it related to the problem itself?
is there any error in my models, calculations or in the pipeline?

please help me to understand, thank you!

teamtom · November 5, 2021, 7:55pm

Finally solved this issue.
I googled it and found that it is worth trying to increase batch size (it was as low as 32 in my original experiments)
so i tried again the same experiment with much higher bach sizes (512, 1000) and tadam!
i experienced no GPU slowdown (not on Kaggle, either Colab)

system · April 20, 2022, 6:02pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elapsed time per epoch much slower for sweep than for individual runs W&B Help sweeps	11	858	July 21, 2023
Sweep agents sometimes become extremely slow W&B Help sweeps , wandb	6	1285	December 21, 2022
Sweep using cpu even default is cuda W&B Help sweeps	4	207	March 27, 2024
Sweeps + Accelerate (mulit GPU) + Trainer W&B Help sweeps	7	1278	January 3, 2025
How do I select a GPU before running a wandb agent? W&B Help sweeps , wandb	10	3189	June 4, 2023

Wandb sweeps running on Kaggle GPU or Colab GPU are much slower than on my local CPU

Related topics