Hi, I am running into an issue I previously posted about Network errors/ ConnectionError. It had resolved for a bit, but now has returned again.
The issue takes the form of a connection error like:
wandb: Network error (ConnectionError), entering retry loop.
2024-03-22 12:26:51 - ERROR - Error on attempt 1 for run 84pul6a3: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2beee3e30>: Failed to resolve 'api.wandb.ai' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:26:51 - INFO - Attempting to download table for run og7hmi2r
2024-03-22 12:26:53 - ERROR - Error on attempt 1 for run mjjj169q: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /wandb-production.appspot.com/bkaplowitz/consumption-savings-rl/mjjj169q/artifact/762588934/wandb_manifest.json?Expires=1711128413&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=dI4LMkTnnKvOkMNZuBeUxzzkS%2BrN3%2FrFioQ6aJP0E735ZV%2FPTpoFt5LBq86fZsgBobh4YMtm5UjzIiWqts8YWUqO%2BKh%2BFXKqgr862IwQDG6iCqQqwzHndlMvCaGx5ikjeY7R%2Bhu17bjCU4bOT8on14J0CFvyXgdzfX%2FuIrCGYYb4QVKjozHTLVskZioMmVIM6EFXrQmY1pUP6RM6klYIgXm9hD2PXEVljTgFDbpTRn0O2PpZ9P1%2FwV1yFTDX2H1A2xmT5C13cTddDJtaOJ1gMGMfqAh8%2B%2BHb4vrRfpUuXfot4n5VHiCrMpximA0xb41U75ajDw4%2BrdBhQ9jsCNGzzw%3D%3D (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2a0623b60>: Failed to resolve 'storage.googleapis.com' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:26:53 - WARNING - Failed to download table for run mjjj169q after 3 retries.
2024-03-22 12:26:53 - INFO - Attempting to download table for run 36ilco9p
2024-03-22 12:26:53 - ERROR - Error on attempt 1 for run gl48xm96: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2a03e7470>: Failed to resolve 'api.wandb.ai' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:26:53 - INFO - Attempting to download table for run df7x71rx
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.
2024-03-22 12:27:27 - ERROR - Error on attempt 1 for run t1djbem8: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x2c14054c0>: Failed to resolve 'api.wandb.ai' ([Errno 8] nodename nor servname provided, or not known)"))
2024-03-22 12:27:27 - INFO - Attempting to download table for run 5f2uzj5c
wandb: Network error (ConnectionError), entering retry loop.
This is when trying to download many tables (about 3000) that I generated via a sweep of runs. The first 600 or so work okay before the error starts popping up.
Eventually, it totally gets stuck on a retry.
My code for the download is:
def download_table(run_id, max_retries=3):
retries = 0
delay = 1
while retries < max_retries:
try:
logging.info(f"Attempting to download table for run {run_id}")
run_name = f"run-{run_id}"
table_remote = current_analysis.use_artifact(
f"{ENTITY}/{PROJECT}/{run_name}-{TABLE_NAME}:{TAG}",
type="run_table",
)
table_dir = table_remote.download()
table_json = json.load(
Path.open(f"{table_dir}/{CHART_DIR}/{TABLE_NAME_NO_DIR}.table.json"),
)
table_wandb = wandb.Table.from_json(table_json, table_remote)
table_local = pd.DataFrame(
data=table_wandb.data,
columns=table_wandb.columns,
).copy()
table_local["run_id"] = run_id
logging.info(f"Successfully downloaded table for run {run_id}")
return table_local
except wandb.errors.Error as e: # catch W&B API specific function
logging.error(f"Error on attempt {retries + 1} for run {run_id}: {e}")
time.sleep(delay) # Wait before retrying
retries += 1
delay *= 2 # Exponential backoff
except Exception as e:
logging.error(f"Error on attempt {retries + 1} for run {run_id}: {e}")
break
logging.warning(
f"Failed to download table for run {run_id} after {max_retries} retries.",
)
return None # Return None if all retries fail
It seems possibly like it may be related to failing to find a table (as some runs don’t succeed due to connection issues etc. during sweep), but that should be handled I think. Any ideas?