Training Plots Disappear once the run is done

I have a similar issue described in

Here is the run Weights & Biases

Any help is appreciated
Thannk!

Hi @samarth93 , could you please expand on your meaning behind charts disappearing? The charts with single data points indicate only a single point was logged to that step. Are you expecting additional data to be logged? Could you describe how you were making wandb log calls. A reproduction script here would be helpful for us to understand the issue a bit better.

Hi @samarth93, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Hi @mohammadbakir

I am using HuggingFace trainer.
If you see the photo I have attached, the “train/loss” plot BEFORE the training has ended shows multiple points logged and that AFTER shows a single plot in “train/train_loss”.

Here is a simple script that shows this.

from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from datasets import load_from_disk, load_dataset
from transformers import GPT2LMHeadModel, AutoConfig, TrainingArguments, Trainer
from transformers import AutoTokenizer
import os

eli5 = load_dataset("eli5_category", split="train[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

eli5 = eli5.flatten()


tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1,
    logging_steps=10,
    report_to="wandb"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

The run is LINKED HERE

You can check from the system log that multiple points were logged indeed

 {'loss': 4.0592, 'learning_rate': 1.879154078549849e-05, 'epoch': 0.06}
 7 {'loss': 4.0272, 'learning_rate': 1.8187311178247734e-05, 'epoch': 0.09}
 8 {'loss': 4.069, 'learning_rate': 1.758308157099698e-05, 'epoch': 0.12}
 9 {'loss': 4.0059, 'learning_rate': 1.6978851963746227e-05, 'epoch': 0.15}
10 {'loss': 3.9932, 'learning_rate': 1.637462235649547e-05, 'epoch': 0.18}
11 {'loss': 3.9708, 'learning_rate': 1.5770392749244713e-05, 'epoch': 0.21}
12 {'loss': 3.9881, 'learning_rate': 1.516616314199396e-05, 'epoch': 0.24}
13 {'loss': 3.98, 'learning_rate': 1.4561933534743205e-05, 'epoch': 0.27}
14 {'loss': 3.9377, 'learning_rate': 1.3957703927492448e-05, 'epoch': 0.3}
15 {'loss': 3.9848, 'learning_rate': 1.3353474320241693e-05, 'epoch': 0.33}
16 {'loss': 3.9886, 'learning_rate': 1.2749244712990937e-05, 'epoch': 0.36}
17 {'loss': 3.9664, 'learning_rate': 1.2145015105740184e-05, 'epoch': 0.39}
18 {'loss': 3.9844, 'learning_rate': 1.1540785498489427e-05, 'epoch': 0.42}
19 {'loss': 3.9746, 'learning_rate': 1.0936555891238672e-05, 'epoch': 0.45}
20 {'loss': 3.9991, 'learning_rate': 1.0332326283987916e-05, 'epoch': 0.48}
21 {'loss': 3.9858, 'learning_rate': 9.728096676737161e-06, 'epoch': 0.51}
22 {'loss': 3.9614, 'learning_rate': 9.123867069486404e-06, 'epoch': 0.54}
23 {'loss': 3.9674, 'learning_rate': 8.51963746223565e-06, 'epoch': 0.57}
24 {'loss': 3.9554, 'learning_rate': 7.915407854984894e-06, 'epoch': 0.6}
25 {'loss': 3.9727, 'learning_rate': 7.3111782477341395e-06, 'epoch': 0.63}
26 {'loss': 3.9345, 'learning_rate': 6.706948640483384e-06, 'epoch': 0.66}
27 {'loss': 3.9536, 'learning_rate': 6.102719033232629e-06, 'epoch': 0.69}
28 {'loss': 3.9285, 'learning_rate': 5.498489425981873e-06, 'epoch': 0.73}
29 {'loss': 3.9787, 'learning_rate': 4.894259818731118e-06, 'epoch': 0.76}
30 {'loss': 3.9878, 'learning_rate': 4.2900302114803626e-06, 'epoch': 0.79}
31 {'loss': 3.9905, 'learning_rate': 3.6858006042296073e-06, 'epoch': 0.82}
32 {'loss': 3.9849, 'learning_rate': 3.081570996978852e-06, 'epoch': 0.85}
33 {'loss': 3.9528, 'learning_rate': 2.477341389728097e-06, 'epoch': 0.88}
34 {'loss': 3.933, 'learning_rate': 1.8731117824773415e-06, 'epoch': 0.91}
35 {'loss': 3.9686, 'learning_rate': 1.2688821752265863e-06, 'epoch': 0.94}
36 {'loss': 3.9153, 'learning_rate': 6.646525679758309e-07, 'epoch': 0.97}
37 {'loss': 3.963, 'learning_rate': 6.042296072507553e-08, 'epoch': 1.0}
38 
39 {'eval_loss': 3.8490514755249023, 'eval_runtime': 15.8753, 'eval_samples_per_second': 158.8, 'eval_steps_per_second': 4.976, 'epoch': 1.0}
40 {'train_runtime': 135.198, 'train_samples_per_second': 78.115, 'train_steps_per_second': 2.448, 'train_loss': 3.9789878270417183, 'epoch': 1.0}

I have the same problem.100%same