Hugging Face Accelerate + Sweeps

Hi,

I am struggling to get sweeps to work with Hugging Face’s Accelerate library. Specifically, the first run of the sweep works fine, but every run thereafter fails due to re-initialising the Accelerator for every run. In every run from the 2nd, I get the error: AcceleratorState has already been initialized and cannot be changed, restart your runtime completely and pass mixed_precision='bf16' to Accelerate().

Below is a minimal example of a script which I’m launching using accelerate launch. I’d appreciate any suggestions. Thanks!

import os
from typing import Any, List, Tuple

from accelerate import Accelerator
from torch import Tensor
from torch.utils.data import Dataset, DataLoader
from transformers import (
    Adafactor,
    PreTrainedTokenizerFast,
    T5ForConditionalGeneration,
    T5TokenizerFast,
)
import wandb


class TestDataset(Dataset[Any]):
    def __init__(self, tokenizer: PreTrainedTokenizerFast) -> None:
        super().__init__()
        self._str_prompt = "This is a "
        self._str_target = "test."
        
        self._tokenizer = tokenizer
    
    def __len__(self) -> int:
        return 1

    def __getitem__(self, idx: int) -> Tuple[str, str]:
        return self._str_prompt, self._str_target
    
    def collate(self, batch: List[Tuple[str, str]]) -> Tuple[Tensor, Tensor]:
        prompts = [b[0] for b in batch]
        targets = [b[1] for b in batch]
        
        prompts_tokenized = self._tokenizer(prompts, return_tensors="pt")
        targets_tokenized = self._tokenizer(targets, return_tensors="pt")
        
        return prompts_tokenized["input_ids"], targets_tokenized["input_ids"]


def main() -> None:
    accelerator = Accelerator(log_with="wandb", mixed_precision="bf16")
    
    if accelerator.is_main_process:
        accelerator.init_trackers(os.environ.get("WANDB_PROJECT"))
    
    accelerator.wait_for_everyone()
    
    wandb_tracker = accelerator.get_tracker("wandb")
    multiplier = wandb_tracker.config["multiplier"]
    
    model = T5ForConditionalGeneration.from_pretrained("t5-small")
    tokenizer = T5TokenizerFast.from_pretrained("t5-small")
    opt = Adafactor(params=model.parameters())
    
    dataset = TestDataset(tokenizer=tokenizer)
    data_loader = DataLoader(dataset=dataset, collate_fn=dataset.collate)
    
    model, opt, data_loader = accelerator.prepare(model, opt, data_loader)
    
    input_ids, labels = next(iter(data_loader))
    
    loss = model(input_ids=input_ids, labels=labels).loss
    
    loss_gathered = accelerator.gather_for_metrics(loss).mean()
    accelerator.log({"loss": loss_gathered.item() * multiplier})
    
    accelerator.end_training()


if __name__ == "__main__":
    sweep_configuration = {
        "method": "random",
        "metric": {"goal": "maximize", "name": "loss"},
        "parameters": {"multiplier": {"values": list(range(100))}},
    }
    
    sweep_id = wandb.sweep(
        sweep=sweep_configuration,
        project=os.environ.get("WANDB_PROJECT"),
    )
    wandb.agent(sweep_id, function=main, count=3)

Hi @harshil, thanks for reporting this! I’ve tested your code in this colab and it seems to be working properly for me. Could you try upgrading wandb, accelerate and transformers to the latestversion? Also, would it be possible for you to share with me the debug files under your local wandb folder and so I can have a look at them and see what’s hapening here? Thanks!

Hi @harshil, I just wanted to follow up here! Would it be possible for you to try upgrading wandb, accelerate and transformers to the latest version? In case the issue is still raising, would it be possible to share the debug files under your local wandb folder? Thanks!

Hi @luis_bergua1, thanks for your reply! Actually I also shared this issue in the Accelerate repo and was advised that the Accelerator must be instantiated outside the main function, which then worked for me. So it’s interesting that it worked for you without doing so…

However I will indeed try upgrading all libraries to the latest version and get back to you on this :+1:

Hi @harshil, great to hear this worked for you! Yes feel free to reach out to me if you need something else.

Hi @luis_bergua1,

Just to let you know, with accelerate-0.16.0, transformers-4.26.1 & wandb-0.13.11 I still get the same issue as above - AcceleratorState has already been initialized and cannot be changed, restart your runtime completely and pass mixed_precision='bf16' to Accelerate().

It works when instantiating the Accelerator outisde the main function.

@luis_bergua1 Related to this, I’m having an issue running on multiple GPUs. I would like to run a sweep where each run of the sweep uses all the GPUs on my machine (i.e. I do not want to parallelise over GPUs).

The issue is that when I try to run this with e.g. 2 GPUs, W&B actually creates 2 sweeps and in total, twice as many runs are performed as I requested with count. So e.g. with the script above where I specified count=3, I get 2 sweeps each with 3 runs, rather than just 1 sweep.

Is there a way around this where instead I only get one sweep, and each run uses all the GPUs?

My accelerate config is as follows, in case it’s useful:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false
1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.