I am using wandb with transformers library on a conda environment created on an HPC machine. When I create the environment and install the libraries, the certificate exists under the path /conda_environments//lib/python3.8/site-packages/certifi/cacert.pem
Then after a specific time of training monitoring (around 10 hours), the certificate disappears, and I get the following error, and then the training and the monitoring are stopped.
OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /scratch/hpc//conda_environments//lib/python3.8/site-packages/certifi/cacert.pem
wandb: ERROR Internal wandb error: file data was not synced
There is a similar issue on the GitHub repo of wandb
I reproduced this issue several times with different models and different pipelines to check that it was not something related to my code. The common pattern is that this happened only for long training (more than 10 hours)
Just some questions to get a better understanding of the issue:
Are you on Public Cloud or a Local Instance?
If you are on a Local Instance, could you send the Debug Bundle? An admin of the instance can get it from the /system-admin page → top right corner W&B icon → Debug Bundle.
If you are on the Public Cloud, could you send your debug.log and debug-internal.log for the run? To get the debug.log and debug-internal.log files, go to the wandb folder in your computer’s working directory. The folder has subfolders named run-DATETIME-ID, which correspond to specific runs. Could you retrieve the debug logs for a run that stopped?