When I run the same model individually as I would in a sweep, the performance is much better in terms of time elapsed per epoch. In one recent test I saw a 3x improvement (10min vs 30min). I am running the bayes sweep, minimizing the val/loss, and using hyperband with min_iter = 1. Both jobs run on a single A100 40Gb GPU. I have also included the following line as I am running on SLURM:
wandb agent --count 1 SWEEP_ID
Hi Noah,
Do you notice this huge difference in performance when you are running Sweep vs a Regular run?
Cheers,
Artsiom
Hello Artsiom,
Yes, that is exactly my issue.
Best,
Noah
Could you potentially share a code snippet and we could see if we can reproduce this on our side?
Warmly,
Artsiom
Hi Noah,
We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.
Best,
Weights & Biases
Hi Noah, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!
Hi artsiom,
Sorry for my slow response. I put together an example to see if I could reproduce the issue, but it doesn’t seem to do so.
I will include it here below in case there is something obvious I can do to stress the system more.
Thanks,
Noah
import os
import wandbfrom argparse import ArgumentParser
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as plclass LitAutoEncoder(pl.LightningModule):
def init(self):
super().init()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 64),
nn.ReLU(),
nn.Linear(64, 3))
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28 * 28))def forward(self, x): embedding = self.encoder(x) return embedding def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) return optimizer def training_step(self, train_batch, batch_idx): x, y = train_batch x = x.view(x.size(0), -1) z = self.encoder(x) x_hat = self.decoder(z) loss = F.mse_loss(x_hat, x) self.log('train_loss', loss) wandb.log({"train_loss": loss}) wandb.log({"epoch": self.current_epoch}) return loss def validation_step(self, val_batch, batch_idx): x, y = val_batch x = x.view(x.size(0), -1) z = self.encoder(x) x_hat = self.decoder(z) loss = F.mse_loss(x_hat, x) self.log('val_loss', loss) wandb.log({"val_loss": loss})
def main(args):
project = os.getenv("TEST_WANDB_PROJ") entity = os.getenv("TEST_WANDB_ACCT") print(f'project: {project}, entity: {entity}') log_dir = os.getenv("TEST_LOG_DIR") if log_dir is None: log_dir = "./data/TEST_LOG_DIR" print( "Using default wandb log dir path of ./data/TEST_LOG_DIR. This can be adjusted with the environment variable `TEST_LOG_DIR`" ) if not os.path.exists(log_dir): os.makedirs(log_dir) assert ( project is not None and entity is not None ), "Please set environment variables `TEST_WANDB_ACCT` and `TEST_WANDB_PROJ` with \n\ your wandb user/organization name and project title, respectively." experiment = wandb.init( project=project, entity=entity, config=args, dir=log_dir, reinit=True, ) config = wandb.config wandb.run.name = args.run_name wandb.run.save() # data dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor()) mnist_train, mnist_val = random_split(dataset, [55000, 5000]) train_loader = DataLoader(mnist_train, batch_size=args.batch_size) val_loader = DataLoader(mnist_val, batch_size=args.batch_size) # model model = LitAutoEncoder() logger = pl.loggers.WandbLogger( experiment=experiment, save_dir="./data/TEST_LOG_DIR") logger.log_hyperparams(config) # training trainer = pl.Trainer(gpus=1, num_nodes=1, precision=32, limit_train_batches=0.5, max_epochs=50) trainer.fit(model, train_loader, val_loader) experiment.finish()
if name == ‘main’:
parser = ArgumentParser() parser.add_argument("--run_name", type=str, required=True) parser.add_argument("--batch_size", type=int, default=32) args = parser.parse_args() main(args)
Hi Noah,
Trying to reproduce this and still no luck on my side, could you send me a link to the workspace of a sweep and a regular run for comparison where one runs slower than the other?
Warmly,
Artsiom
Hi Noah,
We wanted to follow up with you regarding your support request as we have not heard back from you.
Best,
Weights & Biases
Hi Artisiom,
I ran a recent example and I cannot recreate the issue any more. I don’t have any explanation why. Thank you for the help - perhaps it is best if we close the question for now.
Thanks,
Noah
No problem!
I am very glad this has been solved.
Cheers,
Artsiom
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.