I have been using WandB for a while now and everything is working fine but since I switched from a GPU runtime to a TPU runtime (8 cores) (Using the Huggingface accelerator on Google Colab) its not working anymore.
The main process just never gets further than the code line with the WandB login while the other cores continue to work normally so i dont get any error message…
If i switch back to a GPU runtime afterwards, the same code runs without any problems.
It is possible to login before using the notebook_launcher, but then of course I get an error message about different pids
Im using the wandb version 0.12.10, torch-xla 1.9 and accelerate-0.5.1
Sorry you’re having issues with this. Could you tell me a little bit more about how you are running your code so I can try to replicate your issue? I’m connected to a TPU right now and running !wandb login in its own cell is working for me. How are you integrating Accelerate in your training process and where is your wandb login at?
Chris, thank you for the detailed response. Could you try using the CLI command !wandb login directly after you pip install wandb and then commenting wandb.login() out of your training script? The login will persist over any scripts ran on the Colab instance. It’s interesting that this worked on GPU but not the TPU. I’m working on replicating now as this may be a bug with the way wandb.login/notebook_launcher/colab TPU all interact.
I have tried your suggestion, but unfortunately without success.
I am also surprised that it runs with a GPU, but not with a TPU.
Possibly the reason for this could be the process ID (PID).
When I use a GPU, the PID does not change but i get 8 new PIDs, when i use a TPU.
One more difference between the GPU and TPU runtime is, that i dont execute the following line, when i use a GPU:
Hi Chris, I’m able to replicate the issue with some minimal code. Great suggestion about the PID’s. I think it is related to how multiprocessing is working with notebook launcher. Taking notebook launcher out and running the training code works fine. Also if you set num_processes=1 in notebook_launcher it runs correctly. The documentation on notebook launcher is a little vague as to if this means it only trains on 1 core but it looks like this is the case so this isn’t really a helpful solution. I’ll escalate this as a bug to our engineering team to see if they have any more insight on this.
Hey Nate,
I am glad to hear that.
Do you know if your engineering team has already been able to fix the bug or can you at least estimate how long it will take to fix it?
Unfortunately I don’t have a timeline when this may be implemented. I can let you know when I see any movement on this and keep you up to date on progress being made on the issue though.