Without seeing the code or a W&B workspace, it’s hard to say, but here are some thoughts:
Running the agent is not enough overhead to cause a slowdown like you’re seeing, so I suspect there’s something else happening.
Just switching to a GPU doesn’t always speed you up, especially with eager evaluation. Selecting and launching kernels can take longer than executing them on the device.
Could you check the per-epoch iteration time versus the total experiment time? If that has increased by 50%, we’ll need to look inside each epoch for the cause of slowdown.
If the per-epoch time hasn’t gone up, it’s likely that the end of the run is where you’re getting the slowdown. This could be because you’re logging more information during the sweep runs (large numbers/GB of media files, model files). The run won’t terminate until everything has been uploaded.
The original slowdown were noticed with the Churn imbalanced dataset. I created a logistic regression model with a very simple network and tried to find a better threshold to improve the model with W&B sweeps.
I found that using GPU on either Kaggle or Colab the epoch time was way slower compared to running on CPU (on local machine, Kaggle and Colab)
eg. Colab GPU was 2 times slower than Colab CPU or my local CPU (I7, 10th gen Intel)
Kaggle GPU was better than Colab’s but still slower than my local CPU
I also run another experiment: it was MNIST dataset with a CNN to compare. This experiment contained solo runs and sweeps.
i observed no GPU slowdown, the acceleration proved to be ~3x using Kaggle; Colab was only ~2x
so the questions:
why GPU proved way slower than CPU with the first experiment? Is it related to the problem itself?
is there any error in my models, calculations or in the pipeline?
this Github repo contains the Jupyter files and my pipeline code also my related W&B projects
The original slowdown were noticed with the Churn imbalanced dataset. I created a logistic regression model with a very simple network and tried to find a better threshold to improve the model with W&B sweeps.
I found that using GPU on either Kaggle or Colab the epoch time was way slower compared to running on CPU (on local machine, Kaggle and Colab)
eg. Colab GPU was 2 times slower than Colab CPU or my local CPU (I7, 10th gen Intel)
Kaggle GPU was better than Colab’s but still slower than my local CPU
I also run another experiment: it was MNIST dataset with a CNN to compare. This experiment contained solo runs and sweeps.
i observed no GPU slowdown, the acceleration proved to be ~3x using Kaggle; Colab was only ~2x
so the questions:
why GPU proved way slower than CPU with the first experiment? Is it related to the problem itself?
is there any error in my models, calculations or in the pipeline?
Finally solved this issue.
I googled it and found that it is worth trying to increase batch size (it was as low as 32 in my original experiments)
so i tried again the same experiment with much higher bach sizes (512, 1000) and tadam! i experienced no GPU slowdown (not on Kaggle, either Colab)