Taking forever to finish after Waiting for W&B process to finish... (success)

tim-kuipers · June 28, 2023, 2:37pm

First time user here. My runs take forever to finish up. It looks like it’s stuck, but it just takes a really long time to finish after you get the message mentioned in the title of this post. Can take up to 20 minutes.

I’m performing several runs where the scripts looks something like the following. Because I’m still getting things working I often start a run, but interrupt/crash it halfway, so sometimes run.finish() isn’t called for a test run.

I’ve seen this problem reported loads of times, but haven’t really seen much answers. Just devs closing the threads because of inactivity
I hope the below information is enough to pinpoint the problem. I can’t upload all my code unfortunately.


wandb.init(project="mytest", name='test_run', entity='mycomp')

train(num_epochs) # which calls wandb.log({"some_var", same_val})

wandb.finish()

My debug.log ends in:

2023-06-28 16:01:43,966 INFO    MainThread:345959 [wandb_run.py:_config_callback():1283] config_cb ('_wandb', 'visualize', 'batch confusion matrix') {'panel_type': 'Vega2', 'panel_config': {'panelDefId': 'wandb/confusion_matrix/v1', 'fieldSettings': {'Actual': 'Actual', 'Predicted': 'Predicted', 'nPredictions': 'nPredictions'}, 'stringSettings': {'title': ''}, 'transform': {'name': 'tableWithLeafColNames'}, 'userQuery': {'queryFields': [{'name': 'runSets', 'args': [{'name': 'runSets', 'value': '${runSets}'}], 'fields': [{'name': 'id', 'fields': []}, {'name': 'name', 'fields': []}, {'name': '_defaultColorIndex', 'fields': []}, {'name': 'summaryTable', 'args': [{'name': 'tableKey', 'value': 'batch confusion matrix_table'}], 'fields': []}]}]}}} None
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_finish():1890] finishing run robin-radar/IrisTorch_test/o6snnjiv
2023-06-28 16:02:04,250 INFO    MainThread:345959 [jupyter.py:save_history():445] not saving jupyter history
2023-06-28 16:02:04,250 INFO    MainThread:345959 [jupyter.py:save_ipynb():373] not saving jupyter notebook
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_init.py:_jupyter_teardown():435] cleaning up jupyter logic
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_atexit_cleanup():2124] got exitcode: 0
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_restore():2107] restore
2023-06-28 16:02:04,250 INFO    MainThread:345959 [wandb_run.py:_restore():2113] restore done
2023-06-28 16:23:37,067 INFO    MainThread:345959 [wandb_run.py:_footer_history_summary_info():3467] rendering history
2023-06-28 16:23:37,067 INFO    MainThread:345959 [wandb_run.py:_footer_history_summary_info():3499] rendering summary
2023-06-28 16:23:37,069 INFO    MainThread:345959 [wandb_run.py:_footer_sync_info():3426] logging synced files

(The first line you see above is printed a load of times, so I didn’t copy the whole log)

debug.cli has a bunch of messages like this in it:


2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: api.wandb.ai. Connection pool size: 10
2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: storage.googleapis.com. Connection pool size: 10

My debug-internal.log tail looks like

2023-06-28 16:23:35,928 DEBUG   SenderThread:349532 [sender.py:send_request():396] send_request: poll_exit
2023-06-28 16:23:35,928 DEBUG   HandlerThread:349532 [handler.py:handle_request():144] handle_request: sampled_history
2023-06-28 16:23:35,929 DEBUG   SenderThread:349532 [sender.py:send_request():396] send_request: server_info
2023-06-28 16:23:36,065 DEBUG   HandlerThread:349532 [handler.py:handle_request():144] handle_request: shutdown
2023-06-28 16:23:36,065 INFO    HandlerThread:349532 [handler.py:finish():854] shutting down handler
2023-06-28 16:23:36,928 INFO    WriterThread:349532 [datastore.py:close():298] close: /home/tim.kuipers/dev/deeplearning/sandbox/tims_iris_drone/wandb/run-20230628_155311-o6snnjiv/run-o6snnjiv.wandb
2023-06-28 16:23:37,065 INFO    SenderThread:349532 [sender.py:finish():1526] shutting down sender
2023-06-28 16:23:37,065 INFO    SenderThread:349532 [file_pusher.py:finish():159] shutting down file pusher
2023-06-28 16:23:37,065 INFO    SenderThread:349532 [file_pusher.py:join():164] waiting for file pusher

Please note that around 16:40 the jupyter cell was still running.

Ubuntu 20.04, jupyter notebook in VSCode, wandb version 0.15.4

Please advise.

luis_bergua · June 29, 2023, 10:20am

Hi @tim-kuipers, thanks for reporting this! This can be caused by several reason but one thing that may be affecting is if you have any wandb related processes in the background so would it be possible to check this while your run is stuck in the finish process? Another potential cause of the issue is if you’re logging a lot of data to wandb so I was wondering if you see all your metrics and steps on the UI while this is happening and if you could share a link of the run so I can have a look at it. Would you mind sharing the whole debug.log and debug-internal.log files so I can have a deeper look? Feel free to send this via email to luis.bergua@wandb.com

tim-kuipers · June 30, 2023, 11:43am

Please be more precise in the steps you ask me to perform.
What steps would I need to execute in order to check if I have related processes running?
I was running two different models in two jupyter notebook at the same time, so maybe that’s relevant.

The folder of my run was about 4MB on my computer, so I don’t think that was the cause.

I’m currently debugging some stuff. Will pay more attention the next real run and report on the statistics of that one more precisely when I do it. I’ll get back to you soon.

schopra-linum-ai · June 30, 2023, 6:55pm

@luis_bergua I’ve been running a same training model / script the past few days without any issues – but I’ve been having the same problem as Tim, since this morning.

I’m getting the same messages in debug-cli and nothing is getting logged to W&B:

2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: api.wandb.ai. Connection pool size: 10
2023-06-28 16:01:44 WARNING Connection pool is full, discarding connection: storage.googleapis.com. Connection pool size: 1

Any tips on how I can proceed with debugging?

tim-kuipers · July 4, 2023, 9:33am

I’m replying here in the hope that us solving this problem will help other people as well.

Just had the problem again. Waiting 10 minutes after WandB reported it has finished.

My program is creating 500 batch confusion matrix artifact outputs. Those files are just 80kB, tho.

There’s two wandb-service processes running if I look at my system monitor. Perhaps because I am alternating between two notebooks to do experiments. Is this bad practice?

During the wait, data is still being uploaded and I can see updates in all metrics in the UI web interface.

Here’s the project:

Hope you can help to solve the issue with all the information I have provided here!

tim-kuipers · July 4, 2023, 9:33am

debug-internal.log:

gist.github.com

https://gist.github.com/BagelOrb/916e2efc1a527cf397fd453a18cf7d2c

gistfile1.txt

2023-07-04 10:08:14,789 INFO    StreamThr :2419623 [internal.py:wandb_internal():86] W&B internal server running at pid: 2419623, started at: 2023-07-04 10:08:14.789328
2023-07-04 10:08:14,790 DEBUG   HandlerThread:2419623 [handler.py:handle_request():144] handle_request: status
2023-07-04 10:08:14,793 INFO    WriterThread:2419623 [datastore.py:open_for_write():85] open: /home/tim.kuipers/dev/deeplearning/sandbox/tims_iris_drone/wandb/run-20230704_100814-92fjc2lt/run-92fjc2lt.wandb
2023-07-04 10:08:14,793 DEBUG   SenderThread:2419623 [sender.py:send():369] send: header
2023-07-04 10:08:14,795 DEBUG   SenderThread:2419623 [sender.py:send():369] send: run
2023-07-04 10:08:15,122 INFO    SenderThread:2419623 [dir_watcher.py:__init__():211] watching files in: /home/tim.kuipers/dev/deeplearning/sandbox/tims_iris_drone/wandb/run-20230704_100814-92fjc2lt/files
2023-07-04 10:08:15,122 INFO    SenderThread:2419623 [sender.py:_start_run_threads():1100] run started: 92fjc2lt with start time 1688458094.789817
2023-07-04 10:08:15,122 DEBUG   SenderThread:2419623 [sender.py:send_request():396] send_request: summary_record
2023-07-04 10:08:15,122 INFO    SenderThread:2419623 [sender.py:_save_file():1354] saving file wandb-summary.json with policy end
2023-07-04 10:08:15,125 DEBUG   HandlerThread:2419623 [handler.py:handle_request():144] handle_request: check_version

This file has been truncated. show original

debug.log: (Doesn’t show anything happening between 11:08 and 11:20)

gist.github.com

https://gist.github.com/BagelOrb/1b1acefe7cd5ee220630c89b67ed1c3c

gistfile1.txt

2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_setup.py:_flush():76] Current SDK version is 0.15.4
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_setup.py:_flush():76] Configure stats pid to 2416653
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_setup.py:_flush():76] Loading settings from /home/tim.kuipers/.config/wandb/settings
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_setup.py:_flush():76] Loading settings from /home/tim.kuipers/dev/deeplearning/sandbox/tims_iris_drone/wandb/settings
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program': '<python with no main file>'}
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_init.py:_log_setup():507] Logging user logs to /home/tim.kuipers/dev/deeplearning/sandbox/tims_iris_drone/wandb/run-20230704_100814-92fjc2lt/logs/debug.log
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_init.py:_log_setup():508] Logging internal logs to /home/tim.kuipers/dev/deeplearning/sandbox/tims_iris_drone/wandb/run-20230704_100814-92fjc2lt/logs/debug-internal.log
2023-07-04 10:08:14,786 INFO    MainThread:2416653 [wandb_init.py:_jupyter_setup():453] configuring jupyter hooks <wandb.sdk.wandb_init._WandbInit object at 0x7f9394fa7490>

This file has been truncated. show original

tim-kuipers · July 4, 2023, 12:50pm

I notice now that I forgot to call wandb.finish() in one of the two notebooks. Might that cause these issues?

luis_bergua · July 10, 2023, 12:57pm

Hi @tim-kuipers and @schopra-linum-ai , thanks for providing all this information and for your patience! I took a look at the files you provided:

wandb.finish() taking several minutes. This is happening because files are still being saved (checked debug-internal.log), we’re working on improving artifacts with many small files as this is currently taking long times and this is the case affecting here. That’s why, on the UI, data is still being updated as the process is still running
From the logs I didn’t see any issues related to the notebook so this shouldn’t be a problem
I’d recommend you to always call wandb.finish() for your runs to enaure runs are marked as finished properly but this isn’t affecting in this specific case

Hope that this information is useful, please do let me know if something isn’t clear or if you need further assistance here!

system · September 8, 2023, 12:57pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wandb takes too much time after each run ends W&B Help sweeps	6	1267	October 25, 2022
Run.finish() hangs W&B Help	5	1441	July 3, 2023
Waiting for W&B process to finish... (success) W&B Help	12	4564	March 3, 2023
Wandb.finish() takes too long to finish W&B Help wandb	2	789	July 16, 2023
Run.finish() doesn't finish the run W&B Help wandb	11	2870	October 21, 2023

Taking forever to finish after Waiting for W&B process to finish... (success)

Related topics