Wandb sweep showing null for loss

I’m having trouble fixing wandb logging for sweeps. see screenshot for output

I’ve look at topics for the same error but I’m still unable to figure out how to make it work. thanks in advance to whoever takes their time to help me out.

when I want to run sweeps, I set the global var sweep as true, and it runs these functions.
here’s my code snippet for the sweep train function

def sweep_train():
print(‘Setting up sweep…’)
sweep_iteration = 1

def train(config=None):
    nonlocal sweep_iteration
    with wandb.init(config=config):
        config = wandb.config
        print('this is the config for this sweep')
        print(config)
        print(f"Sweep number: {sweep_iteration}")

        spxdata = pd.read_csv('spxdata.csv', parse_dates=['Date'], index_col='Date')
        print(f'Data shape: {spxdata.shape}')
        x = spxdata[features]
        y = spxdata[metric]

        x_train, y_train = x[:-test_size_days], y[:-test_size_days]
        x_test, y_test = x[-test_size_days:], y[-test_size_days:]

        feature_scaler = StandardScaler()
        x_train_scaled = feature_scaler.fit_transform(x_train)
        x_test_scaled = feature_scaler.transform(x_test)

        target_scaler = StandardScaler()
        y_train_scaled = target_scaler.fit_transform(y_train.values.reshape(-1, 1))
        y_test_scaled = target_scaler.transform(y_test.values.reshape(-1, 1))

        model, history = train_model(x_train_scaled, y_train_scaled, x_test_scaled, y_test_scaled)
        print(f'Loss: {history["loss"][-1]}, Val Loss: {history["val_loss"][-1]}')
        wandb.log({"loss": history['loss'][-1], "val_loss": history['val_loss'][-1]})
        print('im inside the train function now')

        sweep_iteration += 1

with open('sweep.yaml') as file:
    sweep_config = yaml.safe_load(file)
sweep_id = wandb.sweep(sweep_config)
wandb.agent(sweep_id, function=train, count=sweep_count)
print('im done now')

if name == ‘main’:
if sweep:
sweep_train()
else:
start_time = time.time()
main()
print(f’Time taken: {time.time() - start_time:.2f} seconds’)

here’s my sweep.yaml file content
method: bayes
metric:
name: val_loss
goal: minimize
parameters:
batch_size:
values: [16, 32, 64]
epochs:
values: [10,]
learning_rate:
max: 0.01
min: 0.0001
dropout:
min: 0.2
max: 0.5

if you need to see any other code lmk I’m willing to share. but for now I hope this is sufficient for debugging

Hey @thilak-cm212, thank you for writing. could you please point me to the workspace you are logging your sweeps to? What is your username, project name and the sweep name you are running the sweeps in?

username: thilak-cm212
project name: Neural Networks Hyperparameter Tuning

funny thing is, when you replied, the issue seems to have fixed itself. but now I’m gettin null values again. so I think until I understand what the issue is I’m not sure ‘just hoping that the issue fixes itself’ like it did last time is a reliable strategy.
anyway, sweep 2z5kq8wx shows null values, its my latest run. whereas, look at zaabf7f8. this works just fine.
here’s a link in case you can’t find the sweeps webpage

Hey @thilak-cm212! Than you for sending over the link.

Looking at this sweep here:

I do see that all of your runs are currently marked as crashed and none of them have any validation loss recorded. Are you expecting to see something different on your side?

yes I see that these runs crashed, explaining the null values in the plot. however, on running my code again, I see that it works.

my doubt is this now: what are potential reasons of run crashes (especially when it previously worked, because I haven’t changed the code from the last run crash and now it works). alternatively, do you have any suggestions on how to sorta reboot the system and check if it works? anything helps, thanks for the prompt replies @artsiom I really appreciate it

Hey @thilak-cm212! Unfortunately, we will not know what exactly crashed your run unless we do some investigation.

Could you please provide the debug.log and debug-internal.log files associated with the run where you are running into this issue? These files should be located in the wandb folder relative to your working directory.

If you are able to see the graphs in the UI for the finished runs, the should not go away and go back to null. If they are going to become null, there is definitely something suspicious going on.

You are able to fully reset your workspace by clicking 3 dots in the top right corner of the screen and pressing reset to default, but it will also reset all of your workspace setup if you have edited your workspace before.

so I just noticed this:
when I run the cli command ‘wandb disabled’ in terminal, and then run my script the values go to null. but when I have wandb enabled (by running ‘wandb enabled’ in terminal), the sweep works just fine.
for now it seems that this is what is crashing the runs and not reflecting the loss value in the sweep graphs.
I’ll test it for a few days and see if I face any errors with ‘wandb enabled’.
do you know why ‘wandb disabled’ is causing this issue?

Thank you for testing it out. I am assuming this happens because once you disable wandb, none of the info gets sent to the wandb cloud and therefore the runs are marked as crashed. Same situation as we cannot run sweeps in in wandb offline mode, because the sweeps controller is up in the cloud and cannot control sweeps configs in the offline mode.

I will follow up with you on wednesday to make sure that this is the issue, and you are unblocked now.

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

hey, thanks a lot for your help. the issue is resolved.

Sounds great! I will go ahead and close this ticket out. If you would like to reopen the conversation, please let us know! Unfortunately, at the moment, we do not receive notifications if a thread reopens on Discourse. So, please feel free to create a new ticket regarding your concern if you’d like to continue the conversation.