Training Plots Disappear once the run is done

samarth93 · March 4, 2024, 7:52pm

I have a similar issue described in

Any help is appreciated
Thannk!

mohammadbakir · March 5, 2024, 5:58pm

Hi @samarth93 , could you please expand on your meaning behind charts disappearing? The charts with single data points indicate only a single point was logged to that step. Are you expecting additional data to be logged? Could you describe how you were making wandb log calls. A reproduction script here would be helpful for us to understand the issue a bit better.

mohammadbakir · March 7, 2024, 8:26pm

Hi @samarth93, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

samarth93 · March 12, 2024, 5:45am

Hi @mohammadbakir

I am using HuggingFace trainer.
If you see the photo I have attached, the “train/loss” plot BEFORE the training has ended shows multiple points logged and that AFTER shows a single plot in “train/train_loss”.

Here is a simple script that shows this.

from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from datasets import load_from_disk, load_dataset
from transformers import GPT2LMHeadModel, AutoConfig, TrainingArguments, Trainer
from transformers import AutoTokenizer
import os

eli5 = load_dataset("eli5_category", split="train[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

eli5 = eli5.flatten()


tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1,
    logging_steps=10,
    report_to="wandb"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

The run is LINKED HERE

You can check from the system log that multiple points were logged indeed

 {'loss': 4.0592, 'learning_rate': 1.879154078549849e-05, 'epoch': 0.06}
 7 {'loss': 4.0272, 'learning_rate': 1.8187311178247734e-05, 'epoch': 0.09}
 8 {'loss': 4.069, 'learning_rate': 1.758308157099698e-05, 'epoch': 0.12}
 9 {'loss': 4.0059, 'learning_rate': 1.6978851963746227e-05, 'epoch': 0.15}
10 {'loss': 3.9932, 'learning_rate': 1.637462235649547e-05, 'epoch': 0.18}
11 {'loss': 3.9708, 'learning_rate': 1.5770392749244713e-05, 'epoch': 0.21}
12 {'loss': 3.9881, 'learning_rate': 1.516616314199396e-05, 'epoch': 0.24}
13 {'loss': 3.98, 'learning_rate': 1.4561933534743205e-05, 'epoch': 0.27}
14 {'loss': 3.9377, 'learning_rate': 1.3957703927492448e-05, 'epoch': 0.3}
15 {'loss': 3.9848, 'learning_rate': 1.3353474320241693e-05, 'epoch': 0.33}
16 {'loss': 3.9886, 'learning_rate': 1.2749244712990937e-05, 'epoch': 0.36}
17 {'loss': 3.9664, 'learning_rate': 1.2145015105740184e-05, 'epoch': 0.39}
18 {'loss': 3.9844, 'learning_rate': 1.1540785498489427e-05, 'epoch': 0.42}
19 {'loss': 3.9746, 'learning_rate': 1.0936555891238672e-05, 'epoch': 0.45}
20 {'loss': 3.9991, 'learning_rate': 1.0332326283987916e-05, 'epoch': 0.48}
21 {'loss': 3.9858, 'learning_rate': 9.728096676737161e-06, 'epoch': 0.51}
22 {'loss': 3.9614, 'learning_rate': 9.123867069486404e-06, 'epoch': 0.54}
23 {'loss': 3.9674, 'learning_rate': 8.51963746223565e-06, 'epoch': 0.57}
24 {'loss': 3.9554, 'learning_rate': 7.915407854984894e-06, 'epoch': 0.6}
25 {'loss': 3.9727, 'learning_rate': 7.3111782477341395e-06, 'epoch': 0.63}
26 {'loss': 3.9345, 'learning_rate': 6.706948640483384e-06, 'epoch': 0.66}
27 {'loss': 3.9536, 'learning_rate': 6.102719033232629e-06, 'epoch': 0.69}
28 {'loss': 3.9285, 'learning_rate': 5.498489425981873e-06, 'epoch': 0.73}
29 {'loss': 3.9787, 'learning_rate': 4.894259818731118e-06, 'epoch': 0.76}
30 {'loss': 3.9878, 'learning_rate': 4.2900302114803626e-06, 'epoch': 0.79}
31 {'loss': 3.9905, 'learning_rate': 3.6858006042296073e-06, 'epoch': 0.82}
32 {'loss': 3.9849, 'learning_rate': 3.081570996978852e-06, 'epoch': 0.85}
33 {'loss': 3.9528, 'learning_rate': 2.477341389728097e-06, 'epoch': 0.88}
34 {'loss': 3.933, 'learning_rate': 1.8731117824773415e-06, 'epoch': 0.91}
35 {'loss': 3.9686, 'learning_rate': 1.2688821752265863e-06, 'epoch': 0.94}
36 {'loss': 3.9153, 'learning_rate': 6.646525679758309e-07, 'epoch': 0.97}
37 {'loss': 3.963, 'learning_rate': 6.042296072507553e-08, 'epoch': 1.0}
38 
39 {'eval_loss': 3.8490514755249023, 'eval_runtime': 15.8753, 'eval_samples_per_second': 158.8, 'eval_steps_per_second': 4.976, 'epoch': 1.0}
40 {'train_runtime': 135.198, 'train_samples_per_second': 78.115, 'train_steps_per_second': 2.448, 'train_loss': 3.9789878270417183, 'epoch': 1.0}

lzyfuyi · March 12, 2024, 12:20pm

I have the same problem.100%same

julia-wolleb · December 9, 2024, 2:27pm

Hi
I have the same issue. While the training is running I see the plots, but once it stops the whole run disappears from the dashboard. Does anyone have an idea how to solve the issue? Any hints would be very welcome

Topic		Replies	Views
Training plots disappeared after run is done W&B Help dashboard , projects , wandb	8	942	January 13, 2024
Training curves all lost suddenly W&B Help dashboard , wandb	4	79	September 26, 2024
When using HF trainer, the logging for the train and eval do not show in charts, why? W&B Help wandb	6	4166	December 15, 2023
All records are lost in a project without any action W&B Help wandb	13	1722	October 24, 2022
Charts only render logs for the frist step W&B Help dashboard , wandb	8	108	October 2, 2024

Training Plots Disappear once the run is done

Related topics