WandB login problem when i use the Huggingface accelerator on a TPU runtime

chrismartin · February 9, 2022, 12:30pm

Hey Guys,

I have been using WandB for a while now and everything is working fine but since I switched from a GPU runtime to a TPU runtime (8 cores) (Using the Huggingface accelerator on Google Colab) its not working anymore.

The main process just never gets further than the code line with the WandB login while the other cores continue to work normally so i dont get any error message…

If i switch back to a GPU runtime afterwards, the same code runs without any problems.

It is possible to login before using the notebook_launcher, but then of course I get an error message about different pids

Im using the wandb version 0.12.10, torch-xla 1.9 and accelerate-0.5.1

system · February 9, 2022, 4:32pm

Hi Chris,

Sorry you’re having issues with this. Could you tell me a little bit more about how you are running your code so I can try to replicate your issue? I’m connected to a TPU right now and running !wandb login in its own cell is working for me. How are you integrating Accelerate in your training process and where is your wandb login at?

Thank you,
Nate

chrismartin · February 10, 2022, 12:40pm

Hey Nate,

sure, here are some screenshots of my heavily shortened code:

chrismartin · February 10, 2022, 12:41pm

And this is what the result of the code looks like when it is run with a tpu:

(Sorry, im just allowed to send 1 Image per Message… -.-)

chrismartin · February 10, 2022, 12:42pm

And this is what the result of the code looks like when it is run with a gpu:

Thank you

chrismartin · February 10, 2022, 12:50pm

By the way, if i comment out the wandb part, then the code runs in a tpu runtime also without problems.

furthermore i also tried to use the wandb code already in the train_pipeline, but then i logically get the following error message:

So nothing is logged, but everything runs as desired.

system · February 10, 2022, 3:40pm

Chris, thank you for the detailed response. Could you try using the CLI command !wandb login directly after you pip install wandb and then commenting wandb.login() out of your training script? The login will persist over any scripts ran on the Colab instance. It’s interesting that this worked on GPU but not the TPU. I’m working on replicating now as this may be a bug with the way wandb.login/notebook_launcher/colab TPU all interact.

Thank you,
Nate

chrismartin · February 10, 2022, 4:56pm

Hey Nate,

thank you for your fast reply.

I have tried your suggestion, but unfortunately without success.

wandb login

I am also surprised that it runs with a GPU, but not with a TPU.
Possibly the reason for this could be the process ID (PID).
When I use a GPU, the PID does not change but i get 8 new PIDs, when i use a TPU.

One more difference between the GPU and TPU runtime is, that i dont execute the following line, when i use a GPU:

!pip install accelerate cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

Instead I just install accelerate (!pip install accelerate)
which means i am using a newer torch version (torch-1.10.0+cu111) on a GPU runtime.

system · February 11, 2022, 4:40pm

Hi Chris, I’m able to replicate the issue with some minimal code. Great suggestion about the PID’s. I think it is related to how multiprocessing is working with notebook launcher. Taking notebook launcher out and running the training code works fine. Also if you set num_processes=1 in notebook_launcher it runs correctly. The documentation on notebook launcher is a little vague as to if this means it only trains on 1 core but it looks like this is the case so this isn’t really a helpful solution. I’ll escalate this as a bug to our engineering team to see if they have any more insight on this.

Thank you,
Nate

chrismartin · February 14, 2022, 9:16pm

Hey Nate,
I am glad to hear that.
Do you know if your engineering team has already been able to fix the bug or can you at least estimate how long it will take to fix it?

Thanks.

system · February 14, 2022, 10:26pm

Hi Chris,

Unfortunately I don’t have a timeline when this may be implemented. I can let you know when I see any movement on this and keep you up to date on progress being made on the issue though.

Thank you,
Nate

chrismartin · February 15, 2022, 1:41pm

Hey,

that would be very nice.

Thank you very much!

system · April 16, 2022, 1:41pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Couldn't make Wandb run on a TPU W&B Help	1	758	September 16, 2021
Many logged in messages W&B Help wandb	6	477	January 5, 2024
PyTorch Tensorboard Sync in distributed training experiments W&B Help	5	469	March 1, 2024
Weird login error with wandb? W&B Help	3	1059	July 9, 2022
Waiting for W&B process to finish... (success) W&B Help	12	4583	March 3, 2023

WandB login problem when i use the Huggingface accelerator on a TPU runtime

Related topics