Wandb port unreachable from SLURM

Hi,

I was logging to Wandb server from the SLURM cluster and it was working without any issues till yesterday. But the connection has started failing suddenly. I tried to relogin but it doesnt seem to fix the issue. I tried to ping the server using “ping api.wandb.ai”. I can see the server is trying to send data to the wandb remote server, but I see that “Destination Port Unreachable” message.

PING api.wandb.ai (35.186.228.49) 56(84) bytes of data.
From xxxxxxx.xxxxx.xxxx icmp_seq=1 Destination Port Unreachable
ping: sendmsg: Operation not permitted

I checked the debug logs and I see the connection error from urllib. Could you please help me solve the issue?

Error Trace:

Traceback (most recent call last):
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/connection.py”, line 174, in _new_conn
conn = connection.create_connection(
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/util/connection.py”, line 95, in create_connection
raise err
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/util/connection.py”, line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/xxxx/xxxx/anaconda3/envs/xxxxx/thesis/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 715, in urlopen
httplib_response = self._make_request(
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 404, in _make_request
self._validate_conn(conn)
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 1058, in _validate_conn
conn.connect()
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/connection.py”, line 363, in connect
self.sock = conn = self._new_conn()
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/connection.py”, line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f19b4219580>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/requests/adapters.py”, line 486, in send
resp = conn.urlopen(
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 799, in urlopen
retries = retries.increment(
File “/xxxx/xxxx/anaconda3/envs/xxxxx/lib/python3.8/site-packages/urllib3/util/retry.py”, line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host=‘api.wandb.ai’, port=443):
Max retries exceeded with url:
/graphql (Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x7f19b4219580>:
Failed to establish a new connection: [Errno 111] Connection refused’))

Hello, This a reply from our support bot designed to assist you with your Weights & Biases related queries. To reach a human please reply to this message.

It seems like you’re encountering a network connectivity issue that prevents your SLURM cluster from communicating with the Weights & Biases server. This could be due to several reasons, including network configuration changes, firewall rules, or issues with the Weights & Biases server itself. Here are some steps to troubleshoot and potentially resolve the issue:

  1. Verify Network Configuration: Ensure that the network configuration on your SLURM cluster hasn’t changed. Sometimes, network policies or firewall settings are updated, which can block outgoing connections to specific ports or addresses.

  2. Check Firewall Rules: Verify that your firewall rules allow outgoing connections to api.wandb.ai on port 443. You might need to work with your network administrator to ensure that these connections are not being blocked.

  3. Test Connectivity: Use tools like ping and traceroute (or tracert on Windows) to test connectivity to api.wandb.ai. While ping might not always respond due to server configurations, traceroute can help identify where the connection is being blocked.

  4. Use curl or wget: Try using curl or wget from your SLURM cluster to access https://api.wandb.ai. This can help confirm if the HTTPS connection can be established. For example:

    curl -I https://api.wandb.ai
    

    This command should return HTTP headers if the connection is successful.

  5. Review W&B Server Logs: If you’re using a self-hosted W&B server, check the server logs for any errors or warnings that might indicate why connections are being refused.

  6. Check W&B Status: Visit the Weights & Biases status page to ensure there are no ongoing issues with the W&B service that could be affecting connectivity.

  7. SLURM Configuration: Ensure that your SLURM job scripts do not contain any settings that could restrict network access. Some clusters have specific partitions or configurations for jobs requiring external network access.

  8. Contact W&B Support: If you’ve gone through these steps and are still facing issues, it might be helpful to contact Weights & Biases support for further assistance. You can reach out to them at support@wandb.com.

Given the error trace you provided, it’s clear that the issue lies with establishing a connection to the W&B API server. This is typically a network-level issue rather than a problem with your code or the W&B client library itself. Therefore, focusing on network troubleshooting steps is likely to be the most effective approach to resolving this issue.

To reach a human please reply to this message.

-WandBot :robot: