BrokenPipeError when doing sweeps

It seems multiple people are facing this problem, and the debug logs are very uninformative.
It seems to be sweeps only because vanilla WandB works fine.

Still, I’ve attached them in case anyone wants to check them out.

Code

I’m doing something like:

    def init_wandb_sweep(self, args: dict, wandb: Callable = None) -> int:
        '''
        Setup Wandb Seep configs. Only run after wandb_logger() has been called
        '''
        sweep_configuration = {
            "method": "random",
            "name": "sweep",
            "metric": {"goal": "maximize", "name": "Train/acc"},
            "parameters": {
                "lr": {"max": 1e-2, "min": 1e-5},
                "drop_rate": {"max": 0.2, "min": 0.0},
                "weight_decay": {"max": 1e-3, "min": 1e-5},
                "grad_clip": {"max": 1.0, "min": 0.1},
            },
        }
        assert wandb is not None, "Wandb logger not initialized"
        
        sweep_id = wandb.sweep(sweep=sweep_configuration, project="<...>",
                               entity='<...>')
        
        args.lr = wandb.config.lr
        args.drop_rate = wandb.config.drop_rate
        args.weight_decay = wandb.config.weight_decay
        args.grad_clip = wandb.config.grad_clip
        
        return sweep_id, args

which is called by:

sweep_id, new_args = logger.init_wandb_sweep(args, wandb_logger)
    
if args.tune_hyperparams:
    args = new_args
    
trainer = Trainer(args, logger=(my_logger, wandb_logger),
                   loaders=(trainloader, valloader),
                   decode_fn=train_dataset.tok.decode,
                   shard=shard,
                   key=key)
    
wandb_logger.agent(sweep_id, function=trainer.train, count=1)

Logs

Can someone please resolve this? all my work is on a hold because I simply can’t do 50 sweeps manually everytime :expressionless:

Hello!

Based on your code, it does seem that the sweeps are erroring out since you are creating a single sweep every time you run your code which is causing some conflicts between the sweep and the wandb library.

I would recommend structuring your code like this where you will create a function with the Trainer taking in arguments and spinning up the Trainer instance. You would then use the sweep to input different hyperparameters into the function. Also, changing the agent’s count to 10 like wandb.agent(sweep_id, train, count=10) will increase the amount of runs you have to 10.

sweep_config = {
    "method": "random",
    "name": "disaster-sweep",
    "metric": {
        "goal": "minimize",
        "name": "train/loss"
    },
    "parameters": {
        "epochs": {
            "values": [5, 10]
        },
        "batch_size": {
            "values": [8, 16, 32, 64]
        },
        "learning_rate": {
            "values": [0.005, 0.0001, 0.00005]
        },
        "weight_decay": {
            "values": [0.0001, 0.1]
        }
    }
}

# Training function with args
def train(config=None):
  with wandb.init(config=config):
    # set sweep configuration
    config = wandb.config

    # set training arguments
    training_args = TrainingArguments(
    output_dir='./results',
	report_to='wandb',  # Turn on Weights & Biases logging
        num_train_epochs=config.epochs,
        learning_rate=config.learning_rate,
        weight_decay=config.weight_decay,
        per_device_train_batch_size=config.batch_size,
        per_device_eval_batch_size=config.batch_size,
        save_strategy='epoch',
        evaluation_strategy='epoch',
        logging_strategy='epoch',
        load_best_model_at_end=True,
        remove_unused_columns=False,
    )


    # define training loop
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        compute_metrics=compute_metrics
    )


    # start training loop
    trainer.train()

sweep_id = wandb.sweep(sweep_config, project='fun-sweep')
wandb.agent(sweep_id, train, count=10)

This is not HF - Trainer is just my custom class to hold all the modules together

Got it, then could you try to increase wandb_logger.agent(sweep_id, function=trainer.train, count=1) to a count more than 1. The count is referring to the number of runs the single agent will execute before it stops. Increasing this will be the first step in making sure more runs are appearing in the UI.

1 Like

Hi Neel, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!