Sweeps: Waiting for W&B process to finish... (failed 1)

Hi everyone,
I am trying to sweep my hyperparameters for my TensorFlow model. I am using Bayes as the sweeping method.

In my train() function, I have several .fit() methods as I am training a progressive GAN and I am required to call model.fit() several times.

After completing the first model.fit() successfully, the error

Waiting for W&B process to finish... (failed 1)

What should I do?

I followed this tutorial here to use Wandb sweeps: Google Colab

And here is a look at my sweep_train() function:

def sweep_train(config_defaults=None):

    # Initialize wandb with a sample project name
    run = wandb.init(config=config_defaults, resume=True)  
    
    pgan = PGAN(latent_dim = NOISE_DIM, d_steps =  wandb.config.D_STEPS)
    

    cbk = GANMonitor(num_img = NUM_IMGS_GENERATE, latent_dim = NOISE_DIM)

    cbk.set_steps(steps_per_epoch = STEPS_PER_EPOCH, epochs = wandb.config.EPOCHS) # 110, 6
    cbk.set_prefix(prefix='0_init')

    
    
    train(wandb.config.G_LR, wandb.config.D_LR, wandb.config.R_LR, wandb.config.EPOCHS, wandb.config.D_STEPS, cbk, pgan)

    run.finish()

Hello @aryamohan23!

Is there more details to the traceback such as right before Waiting for W&B process to finish... (failed 1)? There should be a traceback into the code detailing what error is causing the project to fail. W&B will also upload all runs regardless of status (success, failed, crashed, etc.) so the (failed 1) message is indicating the state of the run.

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hello, sorry for the delay. I get this error:

wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)

I realized it was a network issue and so i set my wandb to offline mode. But even when I try to sync it with a wandb sync i get the same error.

Hello Arya!

Could you provide the debug.log and debug-internal.log fore the run? They should be located in the wandb folder in the same directory as where the script was run. The wandb folder has folders formatted as run-DATETIME-ID associated with a single run. Could you retrieve the debug.log and debug-internal.log files from one of these folders specifically from the run that is having issues?

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hello, sorry to come back to this late.
I seem to have resolved this one issue, it seems to just a VPN issue.
However, now my logged images are not reflectibng on the ‘Charts’ tab on my dashboard, even though all my images can be seen in the ‘files’ tab.
The run also shows as ‘crashed’ even though the files are logged.

Attached the debug and debug-internal files:
debug-internal
debug

Thanks in advance

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.