Hello,
I want to achieve the following behavior:
I have a yaml file containing all hyper-parameters for my experiment. One of the parameters is a list of values. I want to run a separate wandb run for each value in the list while all other hyper-parameters are the same.
For that I split up the config file (containing the hyper-parameters) into multiple config files, each with a different value from the aforementioned list.
Then I loop over the config files and initialize a wandb run. Once the run is over or I abort it, the next wandb run is started with another config file.
Here the loop over the config files:
for run_config in run_configs:
self.create_dirs() # this created new directories for the run to save the logs and models to
self.run_config = copy.deepcopy(run_config) # populate the run configuration
self.run_experiment(self.run_config["tag"]) # run the experiment with the current run configuration
here the code inside self.run_experiment(tag)
run = wandb.init(project="MyProjectName", name="unique name", sync_tensorboard=True, save_code=True,
dir=unique_directory, config=self.run_config, notes="some notes", tags=tag, reinit=True, id=wandb.util.generate_id(),
entity="MyUserName", settings=wandb.Settings(start_method="fork"))
# the settings=wandb.Settings(start_method="fork")) I found in the wandb documentation but it did not solve my issue
try:
# here I train my agent and log stuff to wandb
except KeyboardInterrupt:
# this allows to save the model when interrupting training
pass
finally:
# Release resources
try:
self.save_everything(agent)
run.finish()
env.close()
eval_env.close()
del agent
del env
del eval_env
except EOFError:
pass
return
So once I manually interrupt the execution my model is saved, the wandb run is finished, everything is deleted and the next iteration of the for-loop begins.
The issue:
However, instead of starting the next wandb run after the first one is over/was interrupted I get the following error instead:
Error communicating with wandb process, exiting
wandb Exception: problem
after I updated to wandb version 0.15. I get the following error instead:
wandb: ERROR Run initialization has timed out after 60.0 sec.
EDIT:
I have solved the issue. The problem seems to be connected to the fact that I am using PyCharm.
The problem at hand is that when I interrupt my script in PyCharm using the red Stop-Button in the top right corner, wandb triggers a KeyBoardInterrupt internally and finishes the run. Afterwards, however, I cannot initialize a new run with wandb.init()
When I go to Run/Edit Configurations inside PyCharm and toggle “Emulate terminal in output console”, I can send the KeyBoardInterrupt usind Control+C inside the output console of PyCharm, In that case wandb allows me to reinitialize a new run after the KeyboardInterrupt was caught.
So somehow wandb functions differently in the case where the script receives the KeyBoardInterrupt signal from PyCharm’s red stop button as compared to receiving the KeyBoardInterrupt from the output console with Ctrl+C