InitStartError: Error communicating with wandb process

My code run well on wandb 0.12.21, but after I upgrade to the latest version, my code gave me this error InitStartError: Error communicating with wandb process. I tried the solution in the document but it doesn’t work. Code is as shown below.

def k_fold(config, log_folder=None, log_init_info=None):
    """
    Performs a  k-fold cross validation.

    Args:
        config (Config): Parameters.
        log_folder (None or str, optional): Folder to logs results to. Defaults to None.
        log_init_info (None or dict, optional): Dictionary to init wandb logging.
    """
    scores = []
    nb_folds = 5

    # Data preparation
    print("Creating in-memory dataset ...")

    start_time = time.time()

    in_mem_dataset = InMemoryTrainDataset(
        train_tile_size=config.tile_size,
        reduce_factor=config.reduce_factor,
        train_transfo=HE_preprocess(size=config.tile_size),
        valid_transfo=HE_preprocess(augment=False, size=config.tile_size),
        train_path=config.train_path,
        iter_per_epoch=config.iter_per_epoch,
        on_spot_sampling=config.on_spot_sampling,
        pl_path=config.pl_path,
        use_pl=config.use_pl,
        test_path=config.test_path,
    )
    print(f"Done in {time.time() - start_time :.0f} seconds.")

    for i in config.selected_folds:
        print(f"\n-------------   Fold {i + 1} / {nb_folds}  -------------\n")

        # Init logging
        if not DEBUG:
            print(f"    -> Init wandb logging with name {log_init_info['name_head']}_fold{i + 1} ...")
            wandb.init(
                project=PROJECT_NAME,
                name=f"{log_init_info['name_head']}_fold{i + 1}",
                config=log_init_info['config'],
            )

        meter, history, model = train(config, in_mem_dataset, i, log_folder=log_folder)

        print("\n    -> Validating \n")

        val_images = in_mem_dataset.valid_set
        scores += validate(model, config, val_images)

        if log_folder is not None:
            history.to_csv(log_folder + f"history_{i}.csv", index=False)

        if log_folder is None or len(config.selected_folds) == 1:
            return meter

        if not DEBUG:
            wandb.finish()

        # Garbage collect
        del meter
        del model
        torch.cuda.empty_cache()
        gc.collect()

    print(f"\n\n  ->  Dice CV : {np.mean(scores) :.3f}  +/- {np.std(scores) :.3f}")

Hi Tiny, thanks for writing! Could you please confirm me if you are still having the same error? If so, are you using the 0.13.2 wandb version?

Could you specify if you are having this error in Colab?

Yes, the issue remains. And yes I’m having this trouble when using version 0.13.2. But, not in Colab, instead I encountered this on my local jupyter lab server.

Hi Tianyi, thanks for clarifying! Could you please send me the piece of code where you are using wandb.login()?

This successful running uses the same code that went wrong but with a lower version of wands as mentioned before. https://gist.github.com/NPU-Franklin/f17e1c1875a127df3e82313049e12d92

Hi Tianyi, thanks for your answer! Could you please provide the debug logs files, debug-internal.log and debug.log? You can find them in wandb/run-id/logs

Currently, I’ve downgraded my wandb and deleted all failed runs’ logs. I will reproduce the error tomorrow. Thanks a lot.

run-20220902_211617-2saiqpw0/debug.log: Ubuntu Pastebin
run-20220902_211617-2saiqpw0/debug-internal.log: Ubuntu Pastebin

run-20220902_212941-tt7k9b3u/debug.log: Ubuntu Pastebin
run-20220902_212941-tt7k9b3u/debug-internal.log: haven’t generated

The above are the two runs that triggered the error. The bug occurred in the second run(run-20220902_212941-tt7k9b3u) which was right after the first run(run-20220902_211617-2saiqpw0) finished.

Hi Tianyi! Thanks for the files! In the second run the debug-internal.log hasn’t been generated, could you try sending it again please? Also, could you send me your python file and so I will try to reproduce this error on my end?

Thanks for the quick reply. The second run’s debug-internal.log file literally hasn’t been generated, wandb corrupted before it can generate this file. Sorry for the misleading words. And you can find my python file at google drive. The main training notebook is at CSMMI/notebooks/Training.ipynb.

Hi Tianyi, thanks for the information and for your patience! I need to ask some internal questions and will get back to you then

Thank you for all the help, no need to hurry.

Hi Tianyi, thanks for your patience!! I’ve been revising your code and have some questions:

  • Are you using wandb Local?
  • I assume that the solution in the documentation you are using is this: wandb.init(settings=wandb.Settings(start_method="fork")). If so, have you tried both methods, fork and thread? Are they raising the same error message?
  • You mentioned that you have downgraded wandb, was the error still occurring after this downgrade?
  • I see that, in your notebook, you first run in a cell wandb.login(), which seems to be working fine and then in a following cell you call your k_fold function, where wandb.init() is used. Is the error raising in this cell? Could you send me a screenshot of this error? What happens if you don’t run the wandb.login() cell?

Thank you for your reply. For all the questions:

  • No, I am using the online mode. But I also tried to run my code under the offline mode, and the error still existed.
  • I tried both solutions and they didn’t work.
  • After being downgraded, my code worked fine. No error message.
  • Yes, the error occurred in the cell where I called the k_fold function. The error message should be stored in the notebook I sent you. And I have already login to wandb through the console, so wandb.login() is used for making sure that I am still logging in. If you want me to try to run without the wandb.login() cell, please tell me and I will give it a try.

Hi Tianyi, thanks for your patience! I’ve escalated this issue to our Engineering Team, thanks for reporting it!
Even so, there are a few things that could solve it:

- Could you try setting `WANDB_DISABLE_SETTING=True`? This would disable ‘service’ which uses tcp sockets instead of grpc. This is usually desirable because grpc is not fork safe and introduces overhead in many same node cases.
- Could you try to reset the environment and run the code again and see if the error is still occurring?
- Could you try with a lower version than 0.13.2 (0.13.1 for example)?

Thanks for your help!

Happy to help! I will try these out and come back to you later.

  • First, after I set os.environ["WANDB_DISABLE_SETTING"] = 'True' (Am I doing this right?), it didn’t work.
  • Second, I don’t have enough space to create a new clean environment. Very sorry for that.
  • Third, I tried versions 0.13.3, 0.13.1 and 0.13.0, and they all failed.