Wandb network connection error

Hi, I am running into an issue I previously posted about Network errors/ ConnectionError. It had resolved for a bit, but now has returned again.

The issue takes the form of a connection error like:

wandb: Network error (ConnectionError), entering retry loop.
                                                                                   2024-03-22 12:26:51 - ERROR - Error on attempt 1 for run 84pul6a3: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2beee3e30>: Failed to resolve 'api.wandb.ai' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:26:51 - INFO - Attempting to download table for run og7hmi2r
2024-03-22 12:26:53 - ERROR - Error on attempt 1 for run mjjj169q: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /wandb-production.appspot.com/bkaplowitz/consumption-savings-rl/mjjj169q/artifact/762588934/wandb_manifest.json?Expires=1711128413&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=dI4LMkTnnKvOkMNZuBeUxzzkS%2BrN3%2FrFioQ6aJP0E735ZV%2FPTpoFt5LBq86fZsgBobh4YMtm5UjzIiWqts8YWUqO%2BKh%2BFXKqgr862IwQDG6iCqQqwzHndlMvCaGx5ikjeY7R%2Bhu17bjCU4bOT8on14J0CFvyXgdzfX%2FuIrCGYYb4QVKjozHTLVskZioMmVIM6EFXrQmY1pUP6RM6klYIgXm9hD2PXEVljTgFDbpTRn0O2PpZ9P1%2FwV1yFTDX2H1A2xmT5C13cTddDJtaOJ1gMGMfqAh8%2B%2BHb4vrRfpUuXfot4n5VHiCrMpximA0xb41U75ajDw4%2BrdBhQ9jsCNGzzw%3D%3D (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2a0623b60>: Failed to resolve 'storage.googleapis.com' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:26:53 - WARNING - Failed to download table for run mjjj169q after 3 retries.
2024-03-22 12:26:53 - INFO - Attempting to download table for run 36ilco9p
2024-03-22 12:26:53 - ERROR - Error on attempt 1 for run gl48xm96: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2a03e7470>: Failed to resolve 'api.wandb.ai' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:26:53 - INFO - Attempting to download table for run df7x71rx
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
2024-03-22 12:27:27 - ERROR - Error on attempt 1 for run t1djbem8: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2c14054c0>: Failed to resolve 'api.wandb.ai' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:27:27 - INFO - Attempting to download table for run 5f2uzj5c
wandb: Network error (ConnectionError), entering retry loop.

This is when trying to download many tables (about 3000) that I generated via a sweep of runs. The first 600 or so work okay before the error starts popping up.

Eventually, it totally gets stuck on a retry.
My code for the download is:

def download_table(run_id, max_retries=3):
    retries = 0
    delay = 1
    while retries < max_retries:
        try:
            logging.info(f"Attempting to download table for run {run_id}")
            run_name = f"run-{run_id}"
            table_remote = current_analysis.use_artifact(
                f"{ENTITY}/{PROJECT}/{run_name}-{TABLE_NAME}:{TAG}",
                type="run_table",
            )
            table_dir = table_remote.download()
            table_json = json.load(
                Path.open(f"{table_dir}/{CHART_DIR}/{TABLE_NAME_NO_DIR}.table.json"),
            )
            table_wandb = wandb.Table.from_json(table_json, table_remote)
            table_local = pd.DataFrame(
                data=table_wandb.data,
                columns=table_wandb.columns,
            ).copy()
            table_local["run_id"] = run_id
            logging.info(f"Successfully downloaded table for run {run_id}")
            return table_local
        except wandb.errors.Error as e:  # catch W&B API specific function
            logging.error(f"Error on attempt {retries + 1} for run {run_id}: {e}")
            time.sleep(delay)  # Wait before retrying
            retries += 1
            delay *= 2  # Exponential backoff
        except Exception as e:
            logging.error(f"Error on attempt {retries + 1} for run {run_id}: {e}")
            break
    logging.warning(
        f"Failed to download table for run {run_id} after {max_retries} retries.",
    )

    return None  # Return None if all retries fail

It seems possibly like it may be related to failing to find a table (as some runs don’t succeed due to connection issues etc. during sweep), but that should be handled I think. Any ideas?

Hi @bkaplowitz, Thank you for providing the details and the code you are running when encountering this issue. I will work on reproducing the error you are seeing on our side and will get back to you with additional information.

Hi @bkaplowitz - apologies for my late reply on this one, it has slipped through the net I am truly sorry about that.

With wandb 0.17.0 we have now implemented a new function artifact_exists for the API that would allow checking if an artifact exists before trying to download (see the pr here) with something like:

api = wandb.Api()
if api.artifact_exists('entity/project/artifact_name:latest'):
    artifact = api.artifact('entity/project/artifact_name:latest')

This should prevent any erroring if the run_table artifact doesn’t exist.

Let me know if you have any questions on this and once again apologies for the slow reply.

Hi @bkaplowitz , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved with the new function.

Hi @bkaplowitz , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!