ValueError('signal only works in main thread')

Has anyone else run into this error:
ValueError('signal only works in main thread')

I’m running a hyper parameter sweep using PL and Weights and Biases’s framework.

Running on a GPU on Google Colab which causes all launched runs to fail. Running it locally (Mac OS) prompts ‘signal only works in main thread’ to be printed to stdout (which also happens on Colab) but it doesn’t crash.

When I train the model with just PL outside of a W&B sweep, it works fine.

Any ideas? It seems people using Ray with PL have come across this. The hacky solution presented there ( os.environ['SLURM_JOB_NAME'] = 'bash' ) doesn’t work in my case (neither on Mac OS or Colab).

Linking the GitHub Discussion where the conversation was continued

Hi @max_wasserman , did your issue get resolved?
If not, could you please share the debug log bundle? Also, which wandb version are you on, is it v0.12.4?
I’m assuming this is related to this thread as well.

  1. A minimalist script would help us here in reproducing & pin-pointing the issue. If you could provide us with one, that would be very helpful.
  2. Also, how are you creating and starting the sweep? How many sweep parameters do you have? (I’ve noticed sometimes fewer parameters resolve the issue, hence my reason for asking this question. If this is the case, we file an internal bug report).
  3. Did you try setting WANDB_CONSOLE env var to “off” and see if the error still occurs? This won’t capture the stdout / stderr of the process and would help us in figuring out what might be blocked. Maybe, our console logging got in the way of telling you why it crashed. Hence, you might want to run some runs with: WANDB_CONSOLE=off

We intend that the library never gets into this state and we’re working hard to make it more robust.

1 Like

I’m traveling and away from my work station right now, so an incomplete answer is below.

I believe the version was the latest as of the date of posting, and yes, same issue. I created and started the sweep all inside a python script, as Charles showed in the sweep tutorials on YouTube, which utilized Colab. No YAML files where used.

The issue was totally resolved when I fully ported to YAML files, and only had python file define train() function and call the wandb.init() as in the docs.

As for sweep parameters, anywhere from 10-25. Most only have one value and are simply there for documentation purposes.

I would recommend trying to set up the simplest pytorch lightning model and you can think of while logging some scalars and some dicts of data (I didn’t log anything fancy), set up a hyperparameter sweep inside the python file, use the wandb/PL callback for logging.

If you can’t reproduce with that setup, let me know, and I’d be happy to try the aforementioned fixes when I return next week.

Thanks for the updates @max_wasserman, this was very helpful. We were able to reproduce this bug without using PTL lib on our side. I’ve filed an internal bug report and our eng team is looking into it. We’ll follow-up with you once we’ve an update regarding this issue.
Meanwhile, could you try turning off console logging and re-run your script? For instance: wandb.init(settings=wandb.Settings(console='off')) or you can just set this env variable like: WANDB_CONSOLE=off.

Please let us know if you require any further assistance.