I am trying to train my model and log both the train and eval loss (or ppl perplexity). But when I try it nothing in the charts except the system stats show.
Why?
Link to full code (should be fully contained): https://github.com/brando90/beyond-scale-language-data-diversity/blob/main/src/alignment/fine_tuning_with_aligned_data.py
Snapshot of most important code?
# -- Training arguments and trainer instantiation ref: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments
output_dir = Path(f'~/data/maf_data/results_{today}/').expanduser() if not debug else Path(f'~/data/maf_data/results/').expanduser()
print(f'{debug=} {output_dir=}')
training_args = TrainingArguments(
output_dir=output_dir, #The output directory where the model predictions and checkpoints will be written.
# num_train_epochs = num_train_epochs,
max_steps=max_steps, # TODO: hard to fix, see above
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps, # based on alpaca https://github.com/tatsu-lab/stanford_alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
gradient_checkpointing = gradient_checkpointing, # TODO depending on hardware set to true?
optim="paged_adamw_32bit", # David hall says to keep 32bit opt https://arxiv.org/pdf/2112.11446.pdf TODO: if we are using brain float 16 bf16 should we be using 32 bit? are optimizers always fb32? https://discuss.huggingface.co/t/is-there-a-paged-adamw-16bf-opim-option/51284
per_device_eval_batch_size=per_device_eval_batch_size,
warmup_steps=500, # TODO: once real training starts we can select this number for llama v2, what does llama v2 do to make it stable while v1 didn't?
warmup_ratio=0.03, # copying alpaca for now, number of steps for a linear warmup, TODO once real training starts change?
# weight_decay=0.01, # TODO once real training change?
weight_decay=0.00, # TODO once real training change?
learning_rate = 1e-5, # TODO once real training change? anything larger than -3 I've had terrible experiences with
max_grad_norm=1.0, # TODO once real training change?
lr_scheduler_type="cosine", # TODO once real training change? using what I've seen most in vision
logging_dir=Path('~/data/maf/logs').expanduser(),
save_steps=2000, # alpaca does 2000, other defaults were 500
# logging_steps=500,
logging_steps=50,
remove_unused_columns=False, # TODO don't get why https://stackoverflow.com/questions/76879872/how-to-use-huggingface-hf-trainer-train-with-custom-collate-function/76929999#76929999 , https://claude.ai/chat/475a4638-cee3-4ce0-af64-c8b8d1dc0d90
report_to=report_to, # change to wandb!
fp16=False, # never ever set to True
bf16=torch.cuda.get_device_capability(torch.cuda.current_device())[0] >= 8, # if >= 8 ==> brain float 16 available or set to True if you always want fp32
)
# print(f'{training_args=}')
print(f'{pretrained_model_name_or_path=}')
# TODO: might be nice to figure our how llamav2 counts the number of token's they've trained on
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=lambda data: custom_collate_fn(data, tokenizer=tokenizer)
)
# - TODO bellow is for qlora from falcon, has same interface as Trainer later lets use: https://github.com/artidoro/qlora
# from trl import SFTTrainer
# peft_config = None
# trainer = SFTTrainer(
# model=model,
# train_dataset=trainset,
# peft_config=peft_config,
# dataset_text_field="text",
# max_seq_length=max_seq_length,
# tokenizer=tokenizer,
# args=training_arguments,
# )
# TODO why this? https://discuss.huggingface.co/t/why-do-you-need-to-re-upcast-the-norm-layers-of-hf-falcon-to-fb32/46139
# for name, module in trainer.model.named_modules():
# if "norm" in name:
# module = module.to(torch.float32)