Run timed out with nequip

Hello,
I am trying to run nequip with wandb in a job file. I am getting run time error but If I run the nequip command in login node it works and does not show any error. Any suggestion would be helpful.

Hi @mmou - thank you for reaching out and sorry to hear you are experiencing this.

On the compute node used to run the job, how are you logged into wandb? Are you using wandb.login() or are you setting an WANDB_API_KEY env?

It would also be useful to troubleshoot if you could:

  • test running on the compute node curl -v https://api.wandb.ai to see if this can connect to the W&B cloud.
  • Share The debug.log and debug-internal.log files you should be able to find in the ./wandb/run-<date_time>-<runid>/logs folder on the device logging the run

Hi @mmou , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

It’s not resolved. Unfortunately I cannot upload any files.

I just do this in config.yaml file wandb: true wandb_project: BaO . how to use wandb offilne? I put wandb offilne before in config .yaml it did not run

You can use wandb in offline mode exporting the env variable WANDB_MODE='offline' . Doing so won’t automatically sync the data to the W&B server and you will have to manually sync afterwards with wandb sync. See our docs on env variables here

Can you run on the compute node where you are seeing the timeout error the command curl -v https://api.wandb.ai ? This would help understand if that node can contact the W&B server

  • Trying 35.186.228.49:443…
  • Connected to api.wandb.ai (35.186.228.49) port 443 (#0)
  • ALPN, offering h2
  • ALPN, offering http/1.1
  • CAfile: /etc/pki/tls/certs/ca-bundle.crt
  • TLSv1.0 (OUT), TLS header, Certificate Status (22):
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
  • TLSv1.2 (IN), TLS header, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
  • TLSv1.2 (IN), TLS header, Finished (20):
  • TLSv1.2 (IN), TLS header, Unknown (23):
  • TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
  • TLSv1.3 (IN), TLS handshake, Certificate (11):
  • TLSv1.3 (IN), TLS handshake, CERT verify (15):
  • TLSv1.3 (IN), TLS handshake, Finished (20):
  • TLSv1.2 (OUT), TLS header, Finished (20):
  • TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
  • TLSv1.2 (OUT), TLS header, Unknown (23):
  • TLSv1.3 (OUT), TLS handshake, Finished (20):
  • SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
  • ALPN, server accepted to use h2
  • Server certificate:
  • subject: CN=api.wandb.ai
  • start date: Jun 29 14:34:11 2024 GMT
  • expire date: Sep 27 15:26:45 2024 GMT
  • subjectAltName: host “api.wandb.ai” matched cert’s “api.wandb.ai
  • issuer: C=US; O=Google Trust Services; CN=WR3
  • SSL certificate verify ok.
  • Using HTTP2, server supports multi-use
  • Connection state changed (HTTP/2 confirmed)
  • Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
  • TLSv1.2 (OUT), TLS header, Unknown (23):
  • TLSv1.2 (OUT), TLS header, Unknown (23):
  • TLSv1.2 (OUT), TLS header, Unknown (23):
  • Using Stream ID: 1 (easy handle 0x55761967ed20)
  • TLSv1.2 (OUT), TLS header, Unknown (23):

GET / HTTP/2
Host: api.wandb.ai
user-agent: curl/7.76.1
accept: /

  • TLSv1.2 (IN), TLS header, Unknown (23):
  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • old SSL session ID is stale, removing
  • TLSv1.2 (IN), TLS header, Unknown (23):
  • TLSv1.2 (OUT), TLS header, Unknown (23):
  • TLSv1.2 (IN), TLS header, Unknown (23):
  • TLSv1.2 (IN), TLS header, Unknown (23):
    < HTTP/2 404
    < content-type: text/plain; charset=utf-8
    < vary: Origin
    < x-content-type-options: nosniff
    < date: Fri, 02 Aug 2024 02:55:32 GMT
    < content-length: 19
    < via: 1.1 google
    < alt-svc: h3=“:443”; ma=2592000,h3-29=“:443”; ma=2592000
    <
  • TLSv1.2 (IN), TLS header, Unknown (23):
    404 page not found
  • TLSv1.2 (IN), TLS header, Unknown (23):
  • TLSv1.2 (OUT), TLS header, Unknown (23)

Hi @mmou apologies for the slow reply here.

Did you get this output both on the login and the compute node?
How do you authenticate on the compute node?
Would you be able to set on the compute node WANDB_ENTITY (to your team/personal entity name) and WANDB_API_KEY on the compute node to ensure it’s logging data with your user correctly?

Hi @mmou , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hey @mmou hope all is well. As we have not heard back were going to close this off for you. but please do reach out to us on this or anything else in the future.