wandb.errors.CommError: Run initialization has timed out after 60.0 sec

Hello,
I want to achieve the following behavior:

I have a yaml file containing all hyper-parameters for my experiment. One of the parameters is a list of values. I want to run a separate wandb run for each value in the list while all other hyper-parameters are the same.
For that I split up the config file (containing the hyper-parameters) into multiple config files, each with a different value from the aforementioned list.
Then I loop over the config files and initialize a wandb run. Once the run is over or I abort it, the next wandb run is started with another config file.

Here the loop over the config files:

for run_config in run_configs:
   self.create_dirs()  # this created new directories for the run to save the logs and models to
   self.run_config = copy.deepcopy(run_config)  # populate the run configuration
   self.run_experiment(self.run_config["tag"])  # run the experiment with the current run configuration

here the code inside self.run_experiment(tag)

run = wandb.init(project="MyProjectName", name="unique name", sync_tensorboard=True,  save_code=True,
                             dir=unique_directory,  config=self.run_config,  notes="some notes",  tags=tag,  reinit=True, id=wandb.util.generate_id(),
                             entity="MyUserName",  settings=wandb.Settings(start_method="fork"))
# the settings=wandb.Settings(start_method="fork")) I found in the wandb documentation but it did not solve my issue

try:
    # here I train my agent and log stuff to wandb
except KeyboardInterrupt:
    # this allows to save the model when interrupting training
    pass

finally:
    # Release resources
    try:
        self.save_everything(agent)
         run.finish()
         env.close()
         eval_env.close()
         del agent
         del env
         del eval_env
    except EOFError:
         pass
return

So once I manually interrupt the execution my model is saved, the wandb run is finished, everything is deleted and the next iteration of the for-loop begins.

The issue:

However, instead of starting the next wandb run after the first one is over/was interrupted I get the following error instead:
Error communicating with wandb process, exiting
wandb Exception: problem

after I updated to wandb version 0.15. I get the following error instead:

wandb: ERROR Run initialization has timed out after 60.0 sec.

EDIT:
I have solved the issue. The problem seems to be connected to the fact that I am using PyCharm.
The problem at hand is that when I interrupt my script in PyCharm using the red Stop-Button in the top right corner, wandb triggers a KeyBoardInterrupt internally and finishes the run. Afterwards, however, I cannot initialize a new run with wandb.init()
When I go to Run/Edit Configurations inside PyCharm and toggle “Emulate terminal in output console”, I can send the KeyBoardInterrupt usind Control+C inside the output console of PyCharm, In that case wandb allows me to reinitialize a new run after the KeyboardInterrupt was caught.
So somehow wandb functions differently in the case where the script receives the KeyBoardInterrupt signal from PyCharm’s red stop button as compared to receiving the KeyBoardInterrupt from the output console with Ctrl+C

Also after trying to restructure the code I am running into the same issue. I cannot run one wandb run after the oher with different configurations. Every time I interrupt one run to start the next one (with a new config), wandb fails with the error message

wandb.errors.CommError: Run initialization has timed out after 60.0 sec.

This time I tried the following init call:


wandb.init(project="SemesterThesis_restructured", name=self.wandb_name, sync_tensorboard=True, save_code=True,
                             dir=self.wandb_dir, config=self.run_config, notes="", tags=tag)

and to stop the run I do

       except KeyboardInterrupt:
            # this allows to save the model when interrupting training
            pass

        finally:
            # Release resources
            try:
                self.save_everything(agent)
                print("everything saved")
                wandb.finish()
                print("wandb finished")
                # time.sleep(2)
                env.close()
                eval_env.close()
                del agent
                del env
                del eval_env
                print("everything closed and deleted")
            except EOFError:
                pass
        return

Inside my main I now have the following loop:

def main() -> int:

    parser = ExperimentParser()
    for i in [0, 1]: # loop over different yaml files. One file per experiment
        configs = parser.parse_experiment_config(exp_num=i)
        for config_dict in configs:
            experiment = Experiment(config=config_dict, use_wandb=True, record_eval=False)
            experiment.run_experiment()  # in here I call wandb.init and wand.finish
            del experiment 

    return 0

After debugging some more I found out that the problem only occurs when I interrupt the process manually. If the learning process (inside the try statement) ends naturally (meaning the agent is done learning) everything works as intended. Only when I manually interrupt the learning, wandb fails to restart the next run with the new config.

How can I solve that issue? I want to be able to interrupt the learning manually and start and new run with a new config automatically afterwards.

Hi @erikk , thank you for confirming you were able to successfully resolve your issue.

When wandb catches a KeyboardInterrupt exception, it tries to properly finish and close the current run before exiting. However, in the case where the KeyboardInterrupt is caused by a SIGINT signal from PyCharm, it seems that wandb was not able to properly finish the current run, which prevented you from initializing a new run.

This is difficult to trace, but my best assumption here is in PyCharm, when the stop button is pressed, PyCharm intercepts the signal and handles it internally, before forwarding it to the python interpreter process.

By toggling “Emulate terminal in output console”, you were able to send a Ctrl+C keystroke to the Python interpreter process immediately when you pressed the stop button.

This slight difference in delivery is most likely contributing.

How to solve the pycharm problem? I have the same problem and hope to get your help.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.