Run.finish() doesn't finish the run

When I call run.finish() in the jupyter notebook, it doesn’t finish the run that was initialized by run = wandb.init(project="myproject", resume=True). Instead, it runs forever, until Ctrl+C. The run then cannot be finished correctly in any way.

I have found several similar issues here in the forums, but it always looked like some big artifacts were uploading. It’s not my case. I don’t have any big artifacts. Instead, I just did a few simple calls using LangChain to GPT.

Very rarely, it works ok. For example just now, I’ve upgraded from wandb 0.15.7 to 0.15.8, and suddenly the first run that I tried could be finished OK. But then any further run cannot be finished.

These are the last lines in debug.log:

2023-08-03 16:14:23,573 INFO    MainThread:3124 [wandb_init.py:_pause_backend():418] pausing backend
2023-08-03 16:14:26,417 INFO    MainThread:3124 [wandb_init.py:_resume_backend():423] resuming backend
2023-08-03 16:27:28,618 INFO    MainThread:3124 [jupyter.py:save_ipynb():373] not saving jupyter notebook
2023-08-03 16:27:28,618 INFO    MainThread:3124 [wandb_init.py:_pause_backend():418] pausing backend
2023-08-03 16:27:49,824 INFO    MainThread:3124 [wandb_init.py:_resume_backend():423] resuming backend
2023-08-03 16:27:49,825 INFO    MainThread:3124 [wandb_run.py:_finish():1894] finishing run [the name of my run]
2023-08-03 16:27:49,825 INFO    MainThread:3124 [jupyter.py:save_history():445] not saving jupyter history
2023-08-03 16:27:49,825 INFO    MainThread:3124 [jupyter.py:save_ipynb():373] not saving jupyter notebook
2023-08-03 16:27:49,826 INFO    MainThread:3124 [wandb_init.py:_jupyter_teardown():435] cleaning up jupyter logic
2023-08-03 16:27:49,826 INFO    MainThread:3124 [wandb_run.py:_atexit_cleanup():2128] got exitcode: 0
2023-08-03 16:27:49,826 INFO    MainThread:3124 [wandb_run.py:_restore():2111] restore
2023-08-03 16:27:49,827 INFO    MainThread:3124 [wandb_run.py:_restore():2117] restore done

and this is in debug-internal.log:

2023-08-03 16:36:30,058 DEBUG   SenderThread:3149 [sender.py:send_request():406] send_request: poll_exit
2023-08-03 16:36:31,058 DEBUG   HandlerThread:3149 [handler.py:handle_request():144] handle_request: poll_exit
2023-08-03 16:36:31,058 DEBUG   SenderThread:3149 [sender.py:send_request():406] send_request: poll_exit
2023-08-03 16:36:32,058 DEBUG   HandlerThread:3149 [handler.py:handle_request():144] handle_request: poll_exit

Apparently it’s just growing with the same lines - now some 5 minutes after the run.finish() call which can be seen from debug.log.

I’m running it on a simple default Azure VM with Ubuntu and Python 3.11.4.

Could you please give me some tips where the problem can be? It’s a very annoying issue, almost blocking me from using WandB at all, because every run that I create, I have to then delete :frowning: Otherwise, it remains in the “Running” state (and there is no way how to finish it in the web interface).

Thank you very much

Hello @jan-romportl !

Could you send the full debug logs and a code snippet for you run? It looks like this may be an issue with Jupyter Notebook competing with wandb.finish() to try and finish the run but having the full stack trace will be helpful in determining the true origin.

Hi @raphael-sanandres, I’ll be happy to send them. What exactly do you mean by “debug logs”? All the log files locally stored for the given run? Or maybe some other stuff from there too? And by “code snippet” you mean which other parts of the code? Like the LangChain calls etc.? (because that’s more complex than just a simple snippet) Or just the part where I initialize wandb?

If I should send full logs and more detailed code, isn’t it better if I send it somewhere by mail?

Hi @raphael-sanandres once more, today I realized that if I let wandb.finish() or run.finish() run for almost 16 minutes, then it eventually finishes the run. Otherwise, if I Ctrl+C sooner that in 16 minutes, the run will eventually appear as Crashed after some time.

I’m attaching the full debug-internal.log.

As for the code snippets, I initially set these env variables:

os.environ["LANGCHAIN_WANDB_TRACING"] = "true"
os.environ["WANDB_PROJECT"] = "myprojectname"
os.environ["WANDB_NOTEBOOK_NAME"] = "10_user_model_init.ipynb"

then call

run = wandb.init(project="myprojectname", resume=True)

and after all LangChain calls, I’ll attempt to finish by

run.finish()

or

wandb.finish()

(both with the same results)

It’s harder to show the exact LangChain calls because they’re wrapped in some of our classes. But all our other team members who use exactly the same classes have absolutely no problems when finishing their runs in WandB. It’s just me.

I use Jupyter Notebook inside of VS Code (but so do my team mates).

The full debug log is here: Dropbox - debug-internal.log.zip - Simplify your life

(it cannot be uploaded as a file here)

Thanks for sending the debug logs as well as the explanation that it eventually closes! I am seeing a

wandb-upload_0:3402 [internal_api.py:upload_file():2496] upload_file exception

Which looks like that the run is losing connection (or something is disrupting the connection) to wandb. Do you happen to be behind a VPN/proxy/load balancer that could be affecting your connection? Or perhaps a weak connection? It looks like your run is attempting to upload the output.log to wandb and the connection is getting lost in the process.

Last thing, what wandb version are you currently using?

@raphael-sanandres thanks a lot for the reply. I’m using WandB 0.15.8. As for the connection: I’m running it on a typical Azure Linux Ubuntu VM with no specific firewall set up - basically everything in the default set up. So I guess there cannot be connectivity problem between Azure VM and wandb server. Only maybe some ports? Is there anything special needed in the firewall setup for wandb?

Locally on my computer, I’m using VS Code to connect remotely (via ssh) to that Azure VM. So the user interface of VS Code runs locally on my macbook, which means there could be some connectivity issues (and I’m using ProtonVPN). But the respective python notebook kernel runs remotely on the Azure VM. Or do you think some connectivity issues can happen because of local VS Code? But both output.log and debug-internal.log are stored remotely on the Azure VM, so the uploading attempts are probably happening between the Azure VM and wandb server, without involvement of my local connection?

This is likely the cultprit. There looks to be some sort of connection issue between the wandb processes on your VM and the wandb server so I would suggest looking at the setup there to make sure there is nothing slowing down the process.

Hi Jan,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi @uma-wandb , this issue still unfortunately wasn’t resolved. After consulting it with our WandB account executive (i.e. the person who sold us the WandB licenses for our team) I submitted the issue to support@wandb.com under the ticket number #50157. Unfortunately, it’s been already 4 days and I haven’t heard anything from them.

I’ve consulted it with our CTO who is responsible with setting up our platform on Azure and he isn’t aware of any special networking set up that would prevent this Azure VM instance from communicating with WandB servers.

Moreover, it’s strange that all other artifacts and logs are always uploaded without any problem during the run itself. Only then uploading the file output.logat the end after calling wandb.finish() causes the troubles.

I guess if there were a network issue between Azure VM and WandB server, it would affect also uploading other artifacts, right? Now it’s just the final log. So it probably doesn’t look like a typical networking issue?

We have no idea where to look and what possible network settings to check, that’s the reason why I created the official support ticket. But as I said, no response from them :frowning:

Hey @jan-romportl, sorry to hear about this and apologies for any delay in solving this as well. Thank you for checking with your CTO about your networking setup. Could you try one last potential workaround for me?

Prior to running your script, could you try setting the env var WANDB_DISABLE_SERVICE=True? Please let me know if you reach the same behavior after running this, and I will escalate this to our SDK team.

Best,

Uma

Hi Jan, since we have not heard back from you on this thread, we are going to close this request. I sent a follow up via email, and would love to continue troubleshooting there. If you would like to re-open the conversation here, please let us know!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.