BrokenPipeError when doing sweeps

neel · January 16, 2024, 2:14pm

It seems multiple people are facing this problem, and the debug logs are very uninformative.
It seems to be sweeps only because vanilla WandB works fine.

Still, I’ve attached them in case anyone wants to check them out.

Code

I’m doing something like:

    def init_wandb_sweep(self, args: dict, wandb: Callable = None) -> int:
        '''
        Setup Wandb Seep configs. Only run after wandb_logger() has been called
        '''
        sweep_configuration = {
            "method": "random",
            "name": "sweep",
            "metric": {"goal": "maximize", "name": "Train/acc"},
            "parameters": {
                "lr": {"max": 1e-2, "min": 1e-5},
                "drop_rate": {"max": 0.2, "min": 0.0},
                "weight_decay": {"max": 1e-3, "min": 1e-5},
                "grad_clip": {"max": 1.0, "min": 0.1},
            },
        }
        assert wandb is not None, "Wandb logger not initialized"
        
        sweep_id = wandb.sweep(sweep=sweep_configuration, project="<...>",
                               entity='<...>')
        
        args.lr = wandb.config.lr
        args.drop_rate = wandb.config.drop_rate
        args.weight_decay = wandb.config.weight_decay
        args.grad_clip = wandb.config.grad_clip
        
        return sweep_id, args

which is called by:

sweep_id, new_args = logger.init_wandb_sweep(args, wandb_logger)
    
if args.tune_hyperparams:
    args = new_args
    
trainer = Trainer(args, logger=(my_logger, wandb_logger),
                   loaders=(trainloader, valloader),
                   decode_fn=train_dataset.tok.decode,
                   shard=shard,
                   key=key)
    
wandb_logger.agent(sweep_id, function=trainer.train, count=1)

Logs

neel · January 17, 2024, 11:40am

Can someone please resolve this? all my work is on a hold because I simply can’t do 50 sweeps manually everytime

raphael-sanandres · January 17, 2024, 9:47pm

Hello!

Based on your code, it does seem that the sweeps are erroring out since you are creating a single sweep every time you run your code which is causing some conflicts between the sweep and the wandb library.

I would recommend structuring your code like this where you will create a function with the Trainer taking in arguments and spinning up the Trainer instance. You would then use the sweep to input different hyperparameters into the function. Also, changing the agent’s count to 10 like wandb.agent(sweep_id, train, count=10) will increase the amount of runs you have to 10.

sweep_config = {
    "method": "random",
    "name": "disaster-sweep",
    "metric": {
        "goal": "minimize",
        "name": "train/loss"
    },
    "parameters": {
        "epochs": {
            "values": [5, 10]
        },
        "batch_size": {
            "values": [8, 16, 32, 64]
        },
        "learning_rate": {
            "values": [0.005, 0.0001, 0.00005]
        },
        "weight_decay": {
            "values": [0.0001, 0.1]
        }
    }
}

# Training function with args
def train(config=None):
  with wandb.init(config=config):
    # set sweep configuration
    config = wandb.config

    # set training arguments
    training_args = TrainingArguments(
    output_dir='./results',
	report_to='wandb',  # Turn on Weights & Biases logging
        num_train_epochs=config.epochs,
        learning_rate=config.learning_rate,
        weight_decay=config.weight_decay,
        per_device_train_batch_size=config.batch_size,
        per_device_eval_batch_size=config.batch_size,
        save_strategy='epoch',
        evaluation_strategy='epoch',
        logging_strategy='epoch',
        load_best_model_at_end=True,
        remove_unused_columns=False,
    )


    # define training loop
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        compute_metrics=compute_metrics
    )


    # start training loop
    trainer.train()

sweep_id = wandb.sweep(sweep_config, project='fun-sweep')
wandb.agent(sweep_id, train, count=10)

neel · January 18, 2024, 12:21pm

This is not HF - Trainer is just my custom class to hold all the modules together

raphael-sanandres · January 22, 2024, 9:07pm

Got it, then could you try to increase wandb_logger.agent(sweep_id, function=trainer.train, count=1) to a count more than 1. The count is referring to the number of runs the single agent will execute before it stops. Increasing this will be the first step in making sure more runs are appearing in the UI.

raphael-sanandres · January 25, 2024, 6:07pm

Hi Neel, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Topic		Replies	Views
Broken Pipe error W&B Help sweeps , wandb	2	1797	February 9, 2024
Sweeps ending in just 1 epoch W&B Help sweeps , wandb	4	164	April 18, 2024
Hugging Face with Sweeps causes Broken pipe W&B Help sweeps	2	876	December 24, 2023
Wandb sweep showing null for loss W&B Help sweeps	10	343	July 26, 2024
Wandb sweep not working W&B Help	5	530	June 5, 2024

BrokenPipeError when doing sweeps

Code

Logs

Related topics