Taking forever to finish after Waiting for W&B process to finish... (success)

First time user here. My runs take forever to finish up. It looks like it’s stuck, but it just takes a really long time to finish after you get the message mentioned in the title of this post. Can take up to 20 minutes.

I’m performing several runs where the scripts looks something like the following. Because I’m still getting things working I often start a run, but interrupt/crash it halfway, so sometimes run.finish() isn’t called for a test run.

I’ve seen this problem reported loads of times, but haven’t really seen much answers. Just devs closing the threads because of inactivity :confused:
I hope the below information is enough to pinpoint the problem. I can’t upload all my code unfortunately.


wandb.init(project="mytest", name='test_run', entity='mycomp')

train(num_epochs) # which calls wandb.log({"some_var", same_val})

wandb.finish()

My debug.log ends in:

2023-06-28 16:01:43,966 INFO    MainThread:345959 [wandb_run.py:_config_callback():1283] config_cb ('_wandb', 'visualize', 'batch confusion matrix') {'panel_type': 'Vega2', 'panel_config': {'panelDefId': 'wandb/confusion_matrix/v1', 'fieldSettings': {'Actual': 'Actual', 'Predicted': 'Predicted', 'nPredictions': 'nPredictions'}, 'stringSettings': {'title': ''}, 'transform': {'name': 'tableWithLeafColNames'}, 'userQuery': {'queryFields': [{'name': 'runSets', 'args': [{'name': 'runSets', 'value': '${runSets}'}], 'fields': [{'name': 'id', 'fields': []}, {'name': 'name', 'fields': []}, {'name': '_defaultColorIndex', 'fields': []}, {'name': 'summaryTable', 'args': [{'name': 'tableKey', 'value': 'batch confusion matrix_table'}], 'fields': []}]}]}}} None
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_finish():1890] finishing run robin-radar/IrisTorch_test/o6snnjiv
2023-06-28 16:02:04,250 INFO    MainThread:345959 [jupyter.py:save_history():445] not saving jupyter history
2023-06-28 16:02:04,250 INFO    MainThread:345959 [jupyter.py:save_ipynb():373] not saving jupyter notebook
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_init.py:_jupyter_teardown():435] cleaning up jupyter logic
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_atexit_cleanup():2124] got exitcode: 0
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_restore():2107] restore
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_restore():2113] restore done
2023-06-28 16:23:37,067 INFO    MainThread:345959 [wandb_run.py:_footer_history_summary_info():3467] rendering history
2023-06-28 16:23:37,067 INFO    MainThread:345959 [wandb_run.py:_footer_history_summary_info():3499] rendering summary
2023-06-28 16:23:37,069 INFO    MainThread:345959 [wandb_run.py:_footer_sync_info():3426] logging synced files

(The first line you see above is printed a load of times, so I didn’t copy the whole log)

debug.cli has a bunch of messages like this in it:


2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: api.wandb.ai. Connection pool size: 10
2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: storage.googleapis.com. Connection pool size: 10

My debug-internal.log tail looks like

2023-06-28 16:23:35,928 DEBUG   SenderThread:349532 [sender.py:send_request():396] send_request: poll_exit
2023-06-28 16:23:35,928 DEBUG   HandlerThread:349532 [handler.py:handle_request():144] handle_request: sampled_history
2023-06-28 16:23:35,929 DEBUG   SenderThread:349532 [sender.py:send_request():396] send_request: server_info
2023-06-28 16:23:36,065 DEBUG   HandlerThread:349532 [handler.py:handle_request():144] handle_request: shutdown
2023-06-28 16:23:36,065 INFO    HandlerThread:349532 [handler.py:finish():854] shutting down handler
2023-06-28 16:23:36,928 INFO    WriterThread:349532 [datastore.py:close():298] close: /home/tim.kuipers/dev/deeplearning/sandbox/tims_iris_drone/wandb/run-20230628_155311-o6snnjiv/run-o6snnjiv.wandb
2023-06-28 16:23:37,065 INFO    SenderThread:349532 [sender.py:finish():1526] shutting down sender
2023-06-28 16:23:37,065 INFO    SenderThread:349532 [file_pusher.py:finish():159] shutting down file pusher
2023-06-28 16:23:37,065 INFO    SenderThread:349532 [file_pusher.py:join():164] waiting for file pusher

Please note that around 16:40 the jupyter cell was still running.

Ubuntu 20.04, jupyter notebook in VSCode, wandb version 0.15.4

Please advise.

Hi @tim-kuipers, thanks for reporting this! This can be caused by several reason but one thing that may be affecting is if you have any wandb related processes in the background so would it be possible to check this while your run is stuck in the finish process? Another potential cause of the issue is if you’re logging a lot of data to wandb so I was wondering if you see all your metrics and steps on the UI while this is happening and if you could share a link of the run so I can have a look at it. Would you mind sharing the whole debug.log and debug-internal.log files so I can have a deeper look? Feel free to send this via email to luis.bergua@wandb.com

Please be more precise in the steps you ask me to perform.
What steps would I need to execute in order to check if I have related processes running?
I was running two different models in two jupyter notebook at the same time, so maybe that’s relevant.

The folder of my run was about 4MB on my computer, so I don’t think that was the cause.

I’m currently debugging some stuff. Will pay more attention the next real run and report on the statistics of that one more precisely when I do it. I’ll get back to you soon.

@luis_bergua1 I’ve been running a same training model / script the past few days without any issues – but I’ve been having the same problem as Tim, since this morning.

I’m getting the same messages in debug-cli and nothing is getting logged to W&B:

2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: api.wandb.ai. Connection pool size: 10
2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: storage.googleapis.com. Connection pool size: 1

Any tips on how I can proceed with debugging?

I’m replying here in the hope that us solving this problem will help other people as well.

Just had the problem again. Waiting 10 minutes after WandB reported it has finished.

My program is creating 500 batch confusion matrix artifact outputs. Those files are just 80kB, tho.

There’s two wandb-service processes running if I look at my system monitor. Perhaps because I am alternating between two notebooks to do experiments. Is this bad practice?

During the wait, data is still being uploaded and I can see updates in all metrics in the UI web interface.

Here’s the project:

Hope you can help to solve the issue with all the information I have provided here!

debug-internal.log:

debug.log: (Doesn’t show anything happening between 11:08 and 11:20)

I notice now that I forgot to call wandb.finish() in one of the two notebooks. Might that cause these issues?

Hi @tim-kuipers and @schopra-linum-ai , thanks for providing all this information and for your patience! I took a look at the files you provided:

  • wandb.finish() taking several minutes. This is happening because files are still being saved (checked debug-internal.log), we’re working on improving artifacts with many small files as this is currently taking long times and this is the case affecting here. That’s why, on the UI, data is still being updated as the process is still running
  • From the logs I didn’t see any issues related to the notebook so this shouldn’t be a problem
  • I’d recommend you to always call wandb.finish() for your runs to enaure runs are marked as finished properly but this isn’t affecting in this specific case

Hope that this information is useful, please do let me know if something isn’t clear or if you need further assistance here!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.