WandB login problem when i use the Huggingface accelerator on a TPU runtime

Hey Guys,

I have been using WandB for a while now and everything is working fine but since I switched from a GPU runtime to a TPU runtime (8 cores) (Using the Huggingface accelerator on Google Colab) its not working anymore.

The main process just never gets further than the code line with the WandB login while the other cores continue to work normally so i dont get any error message…

If i switch back to a GPU runtime afterwards, the same code runs without any problems.

It is possible to login before using the notebook_launcher, but then of course I get an error message about different pids

Im using the wandb version 0.12.10, torch-xla 1.9 and accelerate-0.5.1

Hi Chris,

Sorry you’re having issues with this. Could you tell me a little bit more about how you are running your code so I can try to replicate your issue? I’m connected to a TPU right now and running !wandb login in its own cell is working for me. How are you integrating Accelerate in your training process and where is your wandb login at?

Thank you,
Nate

Hey Nate,

sure, here are some screenshots of my heavily shortened code:

And this is what the result of the code looks like when it is run with a tpu:

(Sorry, im just allowed to send 1 Image per Message… -.-)

And this is what the result of the code looks like when it is run with a gpu:

Thank you :slightly_smiling_face:

By the way, if i comment out the wandb part, then the code runs in a tpu runtime also without problems.

furthermore i also tried to use the wandb code already in the train_pipeline, but then i logically get the following error message:

So nothing is logged, but everything runs as desired.

Chris, thank you for the detailed response. Could you try using the CLI command !wandb login directly after you pip install wandb and then commenting wandb.login() out of your training script? The login will persist over any scripts ran on the Colab instance. It’s interesting that this worked on GPU but not the TPU. I’m working on replicating now as this may be a bug with the way wandb.login/notebook_launcher/colab TPU all interact.

Thank you,
Nate

Hey Nate,

thank you for your fast reply.

I have tried your suggestion, but unfortunately without success.

wandb login

I am also surprised that it runs with a GPU, but not with a TPU.
Possibly the reason for this could be the process ID (PID).
When I use a GPU, the PID does not change but i get 8 new PIDs, when i use a TPU.

One more difference between the GPU and TPU runtime is, that i dont execute the following line, when i use a GPU:

!pip install accelerate cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

Instead I just install accelerate (!pip install accelerate)
which means i am using a newer torch version (torch-1.10.0+cu111) on a GPU runtime.

Hi Chris, I’m able to replicate the issue with some minimal code. Great suggestion about the PID’s. I think it is related to how multiprocessing is working with notebook launcher. Taking notebook launcher out and running the training code works fine. Also if you set num_processes=1 in notebook_launcher it runs correctly. The documentation on notebook launcher is a little vague as to if this means it only trains on 1 core but it looks like this is the case so this isn’t really a helpful solution. I’ll escalate this as a bug to our engineering team to see if they have any more insight on this.

Thank you,
Nate

Hey Nate,
I am glad to hear that.
Do you know if your engineering team has already been able to fix the bug or can you at least estimate how long it will take to fix it?

Thanks. :slight_smile:

Hi Chris,

Unfortunately I don’t have a timeline when this may be implemented. I can let you know when I see any movement on this and keep you up to date on progress being made on the issue though.

Thank you,
Nate

1 Like

Hey,

that would be very nice.

Thank you very much! :slight_smile:

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.