Elapsed time per epoch much slower for sweep than for individual runs

When I run the same model individually as I would in a sweep, the performance is much better in terms of time elapsed per epoch. In one recent test I saw a 3x improvement (10min vs 30min). I am running the bayes sweep, minimizing the val/loss, and using hyperband with min_iter = 1. Both jobs run on a single A100 40Gb GPU. I have also included the following line as I am running on SLURM:
wandb agent --count 1 SWEEP_ID

Hi Noah,

Do you notice this huge difference in performance when you are running Sweep vs a Regular run?

Cheers,
Artsiom

Hello Artsiom,

Yes, that is exactly my issue.

Best,
Noah

Could you potentially share a code snippet and we could see if we can reproduce this on our side?

Warmly,
Artsiom

Hi Noah,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

Hi Noah, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Hi artsiom,

Sorry for my slow response. I put together an example to see if I could reproduce the issue, but it doesn’t seem to do so.

I will include it here below in case there is something obvious I can do to stress the system more.

Thanks,
Noah

import os
import wandb

from argparse import ArgumentParser

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl

class LitAutoEncoder(pl.LightningModule):
def init(self):
super().init()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 64),
nn.ReLU(),
nn.Linear(64, 3))
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28 * 28))

def forward(self, x):
    embedding = self.encoder(x)
    return embedding

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
    return optimizer

def training_step(self, train_batch, batch_idx):
    x, y = train_batch
    x = x.view(x.size(0), -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = F.mse_loss(x_hat, x)
    self.log('train_loss', loss)
    wandb.log({"train_loss": loss})
    wandb.log({"epoch": self.current_epoch})
    return loss


def validation_step(self, val_batch, batch_idx):
    x, y = val_batch
    x = x.view(x.size(0), -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = F.mse_loss(x_hat, x)
    self.log('val_loss', loss)
    wandb.log({"val_loss": loss})

def main(args):

project = os.getenv("TEST_WANDB_PROJ")
entity = os.getenv("TEST_WANDB_ACCT")
print(f'project: {project}, entity: {entity}')


log_dir = os.getenv("TEST_LOG_DIR")
if log_dir is None:
    log_dir = "./data/TEST_LOG_DIR"
    print(
        "Using default wandb log dir path of ./data/TEST_LOG_DIR. This can be adjusted with the environment variable `TEST_LOG_DIR`"
    )
if not os.path.exists(log_dir):
    os.makedirs(log_dir)
assert (
    project is not None and entity is not None
), "Please set environment variables `TEST_WANDB_ACCT` and `TEST_WANDB_PROJ` with \n\
    your wandb user/organization name and project title, respectively."
experiment = wandb.init(
    project=project,
    entity=entity,
    config=args,
    dir=log_dir,
    reinit=True,
)
config = wandb.config
wandb.run.name = args.run_name
wandb.run.save()

# data
dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])

train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
val_loader = DataLoader(mnist_val, batch_size=args.batch_size)

# model
model = LitAutoEncoder()

logger = pl.loggers.WandbLogger(
    experiment=experiment, save_dir="./data/TEST_LOG_DIR")
logger.log_hyperparams(config)

# training
trainer = pl.Trainer(gpus=1, num_nodes=1, precision=32, limit_train_batches=0.5, max_epochs=50)
trainer.fit(model, train_loader, val_loader)

experiment.finish()

if name == ‘main’:

parser = ArgumentParser()
parser.add_argument("--run_name", type=str, required=True)
parser.add_argument("--batch_size", type=int, default=32)

args = parser.parse_args()
main(args)

Hi Noah,

Trying to reproduce this and still no luck on my side, could you send me a link to the workspace of a sweep and a regular run for comparison where one runs slower than the other?

Warmly,
Artsiom

Hi Noah,

We wanted to follow up with you regarding your support request as we have not heard back from you.

Best,
Weights & Biases

Hi Artisiom,

I ran a recent example and I cannot recreate the issue any more. I don’t have any explanation why. Thank you for the help - perhaps it is best if we close the question for now.

Thanks,
Noah

No problem!

I am very glad this has been solved. :slightly_smiling_face:

Cheers,
Artsiom

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.