OSError: Could not find a suitable TLS CA certificate

I am using wandb with transformers library on a conda environment created on an HPC machine. When I create the environment and install the libraries, the certificate exists under the path /conda_environments//lib/python3.8/site-packages/certifi/cacert.pem
Then after a specific time of training monitoring (around 10 hours), the certificate disappears, and I get the following error, and then the training and the monitoring are stopped.

OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /scratch/hpc//conda_environments//lib/python3.8/site-packages/certifi/cacert.pem
wandb: ERROR Internal wandb error: file data was not synced

There is a similar issue on the GitHub repo of wandb

I reproduced this issue several times with different models and different pipelines to check that it was not something related to my code. The common pattern is that this happened only for long training (more than 10 hours)

Thank you

Hello Nadhem!

Just some questions to get a better understanding of the issue:

  • Are you on Public Cloud or a Local Instance?
    • If you are on a Local Instance, could you send the Debug Bundle? An admin of the instance can get it from the /system-admin page → top right corner W&B icon → Debug Bundle.
    • If you are on the Public Cloud, could you send your debug.log and debug-internal.log for the run? To get the debug.log and debug-internal.log files, go to the wandb folder in your computer’s working directory. The folder has subfolders named run-DATETIME-ID, which correspond to specific runs. Could you retrieve the debug logs for a run that stopped?

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.