It finally crossed that stage, but now I am getting OOM error
the result of the output file
Script started at Thu Sep 19 06:12:18 UTC 2024
Start Hugging Face....................
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful
31880c0fbd53:30381:30381 [0] NCCL INFO cudaDriverVersion 12040
31880c0fbd53:30381:30381 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
31880c0fbd53:30381:30381 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.19.3+cuda12.1
31880c0fbd53:30381:31051 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
31880c0fbd53:30381:31051 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
31880c0fbd53:30381:31051 [0] NCCL INFO Using non-device net plugin version 0
31880c0fbd53:30381:31051 [0] NCCL INFO Using network Socket
31880c0fbd53:30381:31053 [2] NCCL INFO Using non-device net plugin version 0
31880c0fbd53:30381:31053 [2] NCCL INFO Using network Socket
31880c0fbd53:30381:31052 [1] NCCL INFO Using non-device net plugin version 0
31880c0fbd53:30381:31052 [1] NCCL INFO Using network Socket
31880c0fbd53:30381:31051 [0] NCCL INFO comm 0xcf27f60 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbca86441730ea9b6 - Init START
31880c0fbd53:30381:31053 [2] NCCL INFO comm 0xcf2f460 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 41000 commId 0xbca86441730ea9b6 - Init START
31880c0fbd53:30381:31052 [1] NCCL INFO comm 0xcf2b3d0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 22000 commId 0xbca86441730ea9b6 - Init START
31880c0fbd53:30381:31051 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
31880c0fbd53:30381:31051 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
31880c0fbd53:30381:31051 [0] NCCL INFO NVLS multicast support is not available on dev 0
31880c0fbd53:30381:31052 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
31880c0fbd53:30381:31052 [1] NCCL INFO NVLS multicast support is not available on dev 1
31880c0fbd53:30381:31053 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
31880c0fbd53:30381:31053 [2] NCCL INFO NVLS multicast support is not available on dev 2
31880c0fbd53:30381:31052 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 00/04 : 0 1 2
31880c0fbd53:30381:31052 [1] NCCL INFO P2P Chunksize set to 131072
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 01/04 : 0 1 2
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 02/04 : 0 1 2
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 03/04 : 0 1 2
31880c0fbd53:30381:31051 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
31880c0fbd53:30381:31051 [0] NCCL INFO P2P Chunksize set to 131072
31880c0fbd53:30381:31053 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
31880c0fbd53:30381:31053 [2] NCCL INFO P2P Chunksize set to 131072
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 00 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 01 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 02 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31052 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 03 : 2[2] -> 0[0] via SHM/direct/direct
31880c0fbd53:30381:31051 [0] NCCL INFO Connected all rings
31880c0fbd53:30381:31052 [1] NCCL INFO Connected all rings
31880c0fbd53:30381:31053 [2] NCCL INFO Connected all rings
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 02 : 2[2] -> 1[1] via SHM/direct/direct
31880c0fbd53:30381:31053 [2] NCCL INFO Channel 03 : 2[2] -> 1[1] via SHM/dir
and the error file:
Script started at Thu Sep 19 06:12:18 UTC 2024
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.18.1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run colorful-lake-565
wandb: WARNING Calling wandb.login() after wandb.init() has no effect.
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.33s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.13it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.05it/s]
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:599: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:604: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:599: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:604: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1541: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
0%| | 0/7 [00:00<?, ?it/s]/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
ERROR:__main__:An error occurred during training: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 75.19 MiB is free. Process 1967628 has 508.00 MiB memory in use. Process 1602384 has 78.55 GiB memory in use. Of the allocated memory 77.24 GiB is allocated by PyTorch, and 695.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:894: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py:227: UserWarning: Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.
warnings.warn("Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.")
No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.
INFO:__main__:Model and tokenizer have been saved.
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:599: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:604: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.