Need Help with Multi GPU in single Node using Pytorch

anushkakothari13 · September 10, 2024, 4:39pm

Hey, I am trying to work with the Multi GPUs in a Single Node to parallelize Fine tuning the LLM. I am using the Distributed Data Parallel.

can anyone help me to what am I doing wrong

def model_and_tokenizer_setup(base_model, peft_config):
    try:
        model = AutoModelForCausalLM.from_pretrained(base_model, use_cache=False)
        tokenizer = AutoTokenizer.from_pretrained(base_model, padding_side="left")
        tokenizer.pad_token = tokenizer.eos_token
        model.gradient_checkpointing_enable()
        model = prepare_model_for_kbit_training(model)
        model = get_peft_model(model, peft_config)
        return model, tokenizer
    except Exception as e:
        logger.error(f"Failed to setup model: {e}")
        return None, None

def train_model(dataset, model, tokenizer, training_args, max_length):
    training_args = TrainingArguments(**training_args)
    tokenized_datasets = dataset.map(
        tokenize_function(tokenizer, lambda x: training_prompt("<s>", x, "</s>"), max_length)
    )
    sampler = DistributedSampler(tokenized_datasets)

    dataloader_train = DataLoader(tokenized_datasets["train"], batch_size=training_args['per_device_train_batch_size'], sampler=sampler)
    dataloader_test = DataLoader(tokenized_datasets["test"], batch_size=training_args['per_device_train_batch_size'], sampler=sampler)


    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataloader_train,
        eval_dataset=dataloader_test,
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    return trainer
def ddp_setup(rank, world_size):
    """
    Args:
        rank: Unique identifier of each process
        world_size: Total number of processes
    """
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
def main(rank, world_size):
 ddp_setup(rank, world_size)
 model, tokenizer = model_and_tokenizer_setup(config["base_model"], peft_config)

    print("Start Model Parallelization....................")
    # print(torch.cuda.device_count())
    # if torch.cuda.device_count() > 1:
    #     print("Training on multiple GPUs...")
    #     model.is_parallelizable = True
    #     model.model_parallel = True

    model = DDP(model, device_ids=rank)
    if model is None or tokenizer is None:
        sys.exit("Failed to set up model. Exiting...")
    
    trainer = train_model(dataset, model, tokenizer, config["training_args"], config["max_length"])
    if trainer is None:
        sys.exit("Failed to set up trainer. Exiting...")

    try:
        trainer.train() #resume_from_checkpoint = True
        logger.info("Training completed successfully.")
    except Exception as e:
        logger.error(f"An error occurred during training: {e}")

    destroy_process_group()

    world_size = torch.cuda.device_count()
    mp.spawn(main, args=(world_size), nprocs=world_size)```


Thanks a lot for your help

luis_bergua · September 13, 2024, 3:11pm

Hey @anushkakothari13, thanks for writing in! I would recommend taking a look at this link from our docs about logging distributed experiments and this page where you have an example of Lightning+W&B. Please let me know if this is useful!

anushkakothari13 · September 17, 2024, 12:12pm

Hey @luis_bergua I tried using the accelerate Library:
My code is stuck on [2024-09-17 12:10:12,028] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)

for a very long time,

Can you please help me with this.
Thanks a lot

Regards
Anushka

luis_bergua · September 18, 2024, 10:06am

@anushkakothari13 could you please share the full error trace or output of your code?

anushkakothari13 · September 18, 2024, 11:48am

It finally crossed that stage, but now I am getting OOM error
the result of the output file

Script started at Thu Sep 19 06:12:18 UTC 2024
Start Hugging Face....................
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful
31880c0fbd53:30381:30381 [0] NCCL INFO cudaDriverVersion 12040
31880c0fbd53:30381:30381 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
31880c0fbd53:30381:30381 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.19.3+cuda12.1
31880c0fbd53:30381:31051 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
31880c0fbd53:30381:31051 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
31880c0fbd53:30381:31051 [0] NCCL INFO Using non-device net plugin version 0
31880c0fbd53:30381:31051 [0] NCCL INFO Using network Socket
31880c0fbd53:30381:31053 [2] NCCL INFO Using non-device net plugin version 0
31880c0fbd53:30381:31053 [2] NCCL INFO Using network Socket
31880c0fbd53:30381:31052 [1] NCCL INFO Using non-device net plugin version 0
31880c0fbd53:30381:31052 [1] NCCL INFO Using network Socket
31880c0fbd53:30381:31051 [0] NCCL INFO comm 0xcf27f60 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbca86441730ea9b6 - Init START
31880c0fbd53:30381:31053 [2] NCCL INFO comm 0xcf2f460 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 41000 commId 0xbca86441730ea9b6 - Init START
31880c0fbd53:30381:31052 [1] NCCL INFO comm 0xcf2b3d0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 22000 commId 0xbca86441730ea9b6 - Init START
31880c0fbd53:30381:31051 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
31880c0fbd53:30381:31051 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
31880c0fbd53:30381:31051 [0] NCCL INFO NVLS multicast support is not available on dev 0
31880c0fbd53:30381:31052 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
31880c0fbd53:30381:31052 [1] NCCL INFO NVLS multicast support is not available on dev 1
31880c0fbd53:30381:31053 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
31880c0fbd53:30381:31053 [2] NCCL INFO NVLS multicast support is not available on dev 2
31880c0fbd53:30381:31052 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 00/04 :    0   1   2
31880c0fbd53:30381:31052 [1] NCCL INFO P2P Chunksize set to 131072
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 01/04 :    0   1   2
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 02/04 :    0   1   2
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 03/04 :    0   1   2
31880c0fbd53:30381:31051 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
31880c0fbd53:30381:31051 [0] NCCL INFO P2P Chunksize set to 131072
31880c0fbd53:30381:31053 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
31880c0fbd53:30381:31053 [2] NCCL INFO P2P Chunksize set to 131072
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 00 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 01 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 02 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 03 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Connected all rings
31880c0fbd53:30381:31052 [1] NCCL INFO Connected all rings
31880c0fbd53:30381:31053 [2] NCCL INFO Connected all rings
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 02 : 2[2] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 03 : 2[2] -> 1[1] via SHM/dir

and the error file:

Script started at Thu Sep 19 06:12:18 UTC 2024
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.18.1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run colorful-lake-565
wandb: WARNING Calling wandb.login() after wandb.init() has no effect.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:01<00:01,  1.33s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.13it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.05it/s]
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:599: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:604: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:599: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:604: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1541: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.

  0%|          | 0/7 [00:00<?, ?it/s]/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
ERROR:__main__:An error occurred during training: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 75.19 MiB is free. Process 1967628 has 508.00 MiB memory in use. Process 1602384 has 78.55 GiB memory in use. Of the allocated memory 77.24 GiB is allocated by PyTorch, and 695.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:894: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py:227: UserWarning: Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.
  warnings.warn("Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.")
No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.
INFO:__main__:Model and tokenizer have been saved.
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:599: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:604: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.

anushkakothari13 · September 19, 2024, 11:23am

The Problem is solved,

I was running out of shared memory.

Thanks a lot

luis_bergua · September 19, 2024, 5:18pm

Great to hear that @anushkakothari13, thanks for the update!

Topic		Replies	Views
MultiGPU training W&B Help dashboard , projects , wandb , beginner-friendly , pytorch	2	1332	January 24, 2024
Training hangs with GPU Utilization 100% and wandb trying to sync W&B Help	5	1167	January 16, 2023
Distributed data parallel with pytorch lightning W&B Help	6	482	August 21, 2024
PyTorch Tensorboard Sync in distributed training experiments W&B Help	5	458	March 1, 2024
WandB sweeps and ddp W&B Help sweeps , wandb	3	1185	November 5, 2023

Need Help with Multi GPU in single Node using Pytorch

Related topics