Elapsed time per epoch much slower for sweep than for individual runs

npaulson · March 15, 2023, 9:45pm

When I run the same model individually as I would in a sweep, the performance is much better in terms of time elapsed per epoch. In one recent test I saw a 3x improvement (10min vs 30min). I am running the bayes sweep, minimizing the val/loss, and using hyperband with min_iter = 1. Both jobs run on a single A100 40Gb GPU. I have also included the following line as I am running on SLURM:
wandb agent --count 1 SWEEP_ID

artsiom · March 20, 2023, 7:30pm

Hi Noah,

Do you notice this huge difference in performance when you are running Sweep vs a Regular run?

Cheers,
Artsiom

npaulson · March 21, 2023, 5:46pm

Hello Artsiom,

Yes, that is exactly my issue.

Best,
Noah

artsiom · April 2, 2023, 8:26pm

Could you potentially share a code snippet and we could see if we can reproduce this on our side?

Warmly,
Artsiom

artsiom · April 10, 2023, 2:19pm

Hi Noah,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

artsiom · April 13, 2023, 4:04pm

Hi Noah, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

npaulson · May 2, 2023, 8:45pm

Hi artsiom,

Sorry for my slow response. I put together an example to see if I could reproduce the issue, but it doesn’t seem to do so.

I will include it here below in case there is something obvious I can do to stress the system more.

Thanks,
Noah

import os
import wandb

from argparse import ArgumentParser

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl

class LitAutoEncoder(pl.LightningModule):
def init(self):
super().init()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 64),
nn.ReLU(),
nn.Linear(64, 3))
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28 * 28))

def forward(self, x):
    embedding = self.encoder(x)
    return embedding

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
    return optimizer

def training_step(self, train_batch, batch_idx):
    x, y = train_batch
    x = x.view(x.size(0), -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = F.mse_loss(x_hat, x)
    self.log('train_loss', loss)
    wandb.log({"train_loss": loss})
    wandb.log({"epoch": self.current_epoch})
    return loss


def validation_step(self, val_batch, batch_idx):
    x, y = val_batch
    x = x.view(x.size(0), -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = F.mse_loss(x_hat, x)
    self.log('val_loss', loss)
    wandb.log({"val_loss": loss})

def main(args):

project = os.getenv("TEST_WANDB_PROJ")
entity = os.getenv("TEST_WANDB_ACCT")
print(f'project: {project}, entity: {entity}')


log_dir = os.getenv("TEST_LOG_DIR")
if log_dir is None:
    log_dir = "./data/TEST_LOG_DIR"
    print(
        "Using default wandb log dir path of ./data/TEST_LOG_DIR. This can be adjusted with the environment variable `TEST_LOG_DIR`"
    )
if not os.path.exists(log_dir):
    os.makedirs(log_dir)
assert (
    project is not None and entity is not None
), "Please set environment variables `TEST_WANDB_ACCT` and `TEST_WANDB_PROJ` with \n\
    your wandb user/organization name and project title, respectively."
experiment = wandb.init(
    project=project,
    entity=entity,
    config=args,
    dir=log_dir,
    reinit=True,
)
config = wandb.config
wandb.run.name = args.run_name
wandb.run.save()

# data
dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])

train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
val_loader = DataLoader(mnist_val, batch_size=args.batch_size)

# model
model = LitAutoEncoder()

logger = pl.loggers.WandbLogger(
    experiment=experiment, save_dir="./data/TEST_LOG_DIR")
logger.log_hyperparams(config)

# training
trainer = pl.Trainer(gpus=1, num_nodes=1, precision=32, limit_train_batches=0.5, max_epochs=50)
trainer.fit(model, train_loader, val_loader)

experiment.finish()

if name == ‘main’:

parser = ArgumentParser()
parser.add_argument("--run_name", type=str, required=True)
parser.add_argument("--batch_size", type=int, default=32)

args = parser.parse_args()
main(args)

artsiom · May 15, 2023, 3:28pm

Hi Noah,

Trying to reproduce this and still no luck on my side, could you send me a link to the workspace of a sweep and a regular run for comparison where one runs slower than the other?

Warmly,
Artsiom

artsiom · May 18, 2023, 5:17pm

Hi Noah,

We wanted to follow up with you regarding your support request as we have not heard back from you.

Best,
Weights & Biases

npaulson · May 22, 2023, 2:49pm

Hi Artisiom,

I ran a recent example and I cannot recreate the issue any more. I don’t have any explanation why. Thank you for the help - perhaps it is best if we close the question for now.

Thanks,
Noah

artsiom · May 23, 2023, 7:22pm

No problem!

I am very glad this has been solved.

Cheers,
Artsiom

system · July 21, 2023, 2:50pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sweeps ending in just 1 epoch W&B Help sweeps , wandb	4	164	April 18, 2024
Sweep agents sometimes become extremely slow W&B Help sweeps , wandb	6	1285	December 21, 2022
Wandb sweeps running on Kaggle GPU or Colab GPU are much slower than on my local CPU W&B Help	6	870	April 20, 2022
The sweep agent keeps the same hyperparameters and run id in offline mode W&B Help sweeps	5	294	February 6, 2024
Broken Pipe error W&B Help sweeps , wandb	2	1799	February 9, 2024

Elapsed time per epoch much slower for sweep than for individual runs

Related topics