InitStartError: Error communicating with wandb process

franklin2001 · August 27, 2022, 12:03pm

My code run well on wandb 0.12.21, but after I upgrade to the latest version, my code gave me this error InitStartError: Error communicating with wandb process. I tried the solution in the document but it doesn’t work. Code is as shown below.

def k_fold(config, log_folder=None, log_init_info=None):
    """
    Performs a  k-fold cross validation.

    Args:
        config (Config): Parameters.
        log_folder (None or str, optional): Folder to logs results to. Defaults to None.
        log_init_info (None or dict, optional): Dictionary to init wandb logging.
    """
    scores = []
    nb_folds = 5

    # Data preparation
    print("Creating in-memory dataset ...")

    start_time = time.time()

    in_mem_dataset = InMemoryTrainDataset(
        train_tile_size=config.tile_size,
        reduce_factor=config.reduce_factor,
        train_transfo=HE_preprocess(size=config.tile_size),
        valid_transfo=HE_preprocess(augment=False, size=config.tile_size),
        train_path=config.train_path,
        iter_per_epoch=config.iter_per_epoch,
        on_spot_sampling=config.on_spot_sampling,
        pl_path=config.pl_path,
        use_pl=config.use_pl,
        test_path=config.test_path,
    )
    print(f"Done in {time.time() - start_time :.0f} seconds.")

    for i in config.selected_folds:
        print(f"\n-------------   Fold {i + 1} / {nb_folds}  -------------\n")

        # Init logging
        if not DEBUG:
            print(f"    -> Init wandb logging with name {log_init_info['name_head']}_fold{i + 1} ...")
            wandb.init(
                project=PROJECT_NAME,
                name=f"{log_init_info['name_head']}_fold{i + 1}",
                config=log_init_info['config'],
            )

        meter, history, model = train(config, in_mem_dataset, i, log_folder=log_folder)

        print("\n    -> Validating \n")

        val_images = in_mem_dataset.valid_set
        scores += validate(model, config, val_images)

        if log_folder is not None:
            history.to_csv(log_folder + f"history_{i}.csv", index=False)

        if log_folder is None or len(config.selected_folds) == 1:
            return meter

        if not DEBUG:
            wandb.finish()

        # Garbage collect
        del meter
        del model
        torch.cuda.empty_cache()
        gc.collect()

    print(f"\n\n  ->  Dice CV : {np.mean(scores) :.3f}  +/- {np.std(scores) :.3f}")

system · August 31, 2022, 9:35am

Hi Tiny, thanks for writing! Could you please confirm me if you are still having the same error? If so, are you using the 0.13.2 wandb version?

Could you specify if you are having this error in Colab?

franklin2001 · August 31, 2022, 9:49am

Yes, the issue remains. And yes I’m having this trouble when using version 0.13.2. But, not in Colab, instead I encountered this on my local jupyter lab server.

system · September 1, 2022, 4:33pm

Hi Tianyi, thanks for clarifying! Could you please send me the piece of code where you are using wandb.login()?

franklin2001 · September 2, 2022, 2:33am

This successful running uses the same code that went wrong but with a lower version of wands as mentioned before. https://gist.github.com/NPU-Franklin/f17e1c1875a127df3e82313049e12d92

system · September 2, 2022, 9:14am

Hi Tianyi, thanks for your answer! Could you please provide the debug logs files, debug-internal.log and debug.log? You can find them in wandb/run-id/logs

franklin2001 · September 2, 2022, 10:53am

Currently, I’ve downgraded my wandb and deleted all failed runs’ logs. I will reproduce the error tomorrow. Thanks a lot.

franklin2001 · September 2, 2022, 1:41pm

run-20220902_211617-2saiqpw0/debug.log: Ubuntu Pastebin
run-20220902_211617-2saiqpw0/debug-internal.log: Ubuntu Pastebin

franklin2001 · September 2, 2022, 1:42pm

run-20220902_212941-tt7k9b3u/debug.log: Ubuntu Pastebin
run-20220902_212941-tt7k9b3u/debug-internal.log: haven’t generated

franklin2001 · September 2, 2022, 1:47pm

The above are the two runs that triggered the error. The bug occurred in the second run(run-20220902_212941-tt7k9b3u) which was right after the first run(run-20220902_211617-2saiqpw0) finished.

system · September 2, 2022, 4:47pm

Hi Tianyi! Thanks for the files! In the second run the debug-internal.log hasn’t been generated, could you try sending it again please? Also, could you send me your python file and so I will try to reproduce this error on my end?

franklin2001 · September 3, 2022, 1:57am

Thanks for the quick reply. The second run’s debug-internal.log file literally hasn’t been generated, wandb corrupted before it can generate this file. Sorry for the misleading words. And you can find my python file at google drive. The main training notebook is at CSMMI/notebooks/Training.ipynb.

system · September 6, 2022, 3:52pm

Hi Tianyi, thanks for the information and for your patience! I need to ask some internal questions and will get back to you then

franklin2001 · September 7, 2022, 2:18pm

Thank you for all the help, no need to hurry.

system · September 7, 2022, 4:01pm

Hi Tianyi, thanks for your patience!! I’ve been revising your code and have some questions:

Are you using wandb Local?
I assume that the solution in the documentation you are using is this: wandb.init(settings=wandb.Settings(start_method="fork")). If so, have you tried both methods, fork and thread? Are they raising the same error message?
You mentioned that you have downgraded wandb, was the error still occurring after this downgrade?
I see that, in your notebook, you first run in a cell wandb.login(), which seems to be working fine and then in a following cell you call your k_fold function, where wandb.init() is used. Is the error raising in this cell? Could you send me a screenshot of this error? What happens if you don’t run the wandb.login() cell?

franklin2001 · September 8, 2022, 4:06am

Thank you for your reply. For all the questions:

No, I am using the online mode. But I also tried to run my code under the offline mode, and the error still existed.
I tried both solutions and they didn’t work.
After being downgraded, my code worked fine. No error message.
Yes, the error occurred in the cell where I called the k_fold function. The error message should be stored in the notebook I sent you. And I have already login to wandb through the console, so wandb.login() is used for making sure that I am still logging in. If you want me to try to run without the wandb.login() cell, please tell me and I will give it a try.

franklin2001 · September 8, 2022, 7:45am

system · September 9, 2022, 4:15pm

Hi Tianyi, thanks for your patience! I’ve escalated this issue to our Engineering Team, thanks for reporting it!
Even so, there are a few things that could solve it:

- Could you try setting `WANDB_DISABLE_SETTING=True`? This would disable ‘service’ which uses tcp sockets instead of grpc. This is usually desirable because grpc is not fork safe and introduces overhead in many same node cases.
- Could you try to reset the environment and run the code again and see if the error is still occurring?
- Could you try with a lower version than 0.13.2 (0.13.1 for example)?

Thanks for your help!

franklin2001 · September 10, 2022, 1:27am

Happy to help! I will try these out and come back to you later.

franklin2001 · September 13, 2022, 7:08am

First, after I set os.environ["WANDB_DISABLE_SETTING"] = 'True' (Am I doing this right?), it didn’t work.
Second, I don’t have enough space to create a new clean environment. Very sorry for that.
Third, I tried versions 0.13.3, 0.13.1 and 0.13.0, and they all failed.

Topic		Replies	Views
Traceback error W&B Help	7	3639	November 19, 2022
Weird login error with wandb? W&B Help	3	1054	July 9, 2022
Unable to login W&B Help wandb	16	6825	September 5, 2023
CommError: Run initialization has timed out after 90.0 sec W&B Help wandb	7	2766	July 4, 2024
Login error! init error + broken pipeline W&B Help wandb	5	555	October 28, 2022

InitStartError: Error communicating with wandb process

Related topics