Wandb stops uploading data

Why does wandb’s visualization stop at a specific training step without updating when training is not stopped ?Even though I have turned down the memory usage, it still doesn’t work. Also, the logs show that no exceptions were thrown and the project status is still “running”.

Hi @202163002 , happy to help. To assist further I would like to ask below:

I would like to confirm from what you said, the run is still active and the metrics aren’t updating-did you see any of the metrics update when the run completed?

How many metrics are being logged?

Are you running locally or on public cloud(wandb.ai)? If it is on public would you object to share a Run link for this?

The metrics were not updated, even though the training process was over. In fact I have twenty-five metrics logged in the program, including multiple partial losses and various validation metrics. Also, I used the “offline” setting after discovering the problem, but it still doesn’t show the complete training process after synchronizing the data.
Here is a screenshot of the workplace interface for the project. Unfortunately I didn’t use the public cloud for logging, it runs at Weights & Biases, hopefully this can be helpful to you.

Thank you for the details @202163002 , the run seems to have failed. May we request for the SDK Debug logs so we can investigate further.

Thank you very much for your response. Actually, the task in the image is actually completed, however it stops at around 10,000 steps, a value that is the same for repeated training. I’m sorry I couldn’t find the SDK you mentioned, I’ve provided some of the logs that wandb keeps locally (debug and debug-internal), hopefully they’ll be useful in your judgment. In addition, I adjusted the program from private to public.(Weights & Biases)

2024-01-24 15:03:34,429 WARNING MsgRouterThr:2341252 [router.py:message_loop():76] message_loop has been closed
2024-01-24 15:03:30,034 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,135 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:30,136 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,211 INFO    Thread-20 :2342544 [upload_job.py:push():138] Uploaded file /mnt/data1/home/xjru/Desktop/sbdd-main/logdir/wandb/run-20240124_093535-SE3-joint-fullAtom/files/output.log
2024-01-24 15:03:30,237 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:30,237 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,339 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:30,339 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,411 INFO    Thread-11 (_thread_body):2342544 [sender.py:transition_state():459] send defer: 9
2024-01-24 15:03:30,412 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: defer
2024-01-24 15:03:30,412 INFO    HandlerThread:2342544 [handler.py:handle_request_defer():164] handle defer: 9
2024-01-24 15:03:30,412 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: defer
2024-01-24 15:03:30,413 INFO    SenderThread:2342544 [sender.py:send_request_defer():455] handle sender defer: 9
2024-01-24 15:03:30,441 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:31,724 INFO    SenderThread:2342544 [sender.py:transition_state():459] send defer: 10
2024-01-24 15:03:31,724 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:31,725 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: defer
2024-01-24 15:03:31,725 INFO    HandlerThread:2342544 [handler.py:handle_request_defer():164] handle defer: 10
2024-01-24 15:03:31,726 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: defer
2024-01-24 15:03:31,726 INFO    SenderThread:2342544 [sender.py:send_request_defer():455] handle sender defer: 10
2024-01-24 15:03:31,727 INFO    SenderThread:2342544 [sender.py:transition_state():459] send defer: 11
2024-01-24 15:03:31,727 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: defer
2024-01-24 15:03:31,728 DEBUG   SenderThread:2342544 [sender.py:send():302] send: final
2024-01-24 15:03:31,728 INFO    HandlerThread:2342544 [handler.py:handle_request_defer():164] handle defer: 11
2024-01-24 15:03:31,728 DEBUG   SenderThread:2342544 [sender.py:send():302] send: footer
2024-01-24 15:03:31,728 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: defer
2024-01-24 15:03:31,728 INFO    SenderThread:2342544 [sender.py:send_request_defer():455] handle sender defer: 11
2024-01-24 15:03:31,827 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:31,828 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:31,828 INFO    SenderThread:2342544 [file_pusher.py:join():176] waiting for file pusher
2024-01-24 15:03:33,408 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: sampled_history
2024-01-24 15:03:33,418 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: get_summary
2024-01-24 15:03:33,419 INFO    MainThread:2342544 [wandb_run.py:_footer_history_summary_info():3305] rendering history
2024-01-24 15:03:33,424 INFO    MainThread:2342544 [wandb_run.py:_footer_history_summary_info():3337] rendering summary
2024-01-24 15:03:33,425 INFO    MainThread:2342544 [wandb_run.py:_footer_sync_info():3261] logging synced files
2024-01-24 15:03:33,426 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: shutdown
2024-01-24 15:03:33,426 INFO    HandlerThread:2342544 [handler.py:finish():810] shutting down handler
2024-01-24 15:03:33,728 INFO    WriterThread:2342544 [datastore.py:close():279] close: /mnt/data1/home/xjru/Desktop/sbdd-main/logdir/wandb/run-20240124_093535-SE3-joint-fullAtom/run-SE3-joint-fullAtom.wandb
2024-01-24 15:03:34,306 INFO    SenderThread:2342544 [sender.py:finish():1312] shutting down sender
2024-01-24 15:03:34,307 INFO    SenderThread:2342544 [file_pusher.py:finish():171] shutting down file pusher
2024-01-24 15:03:34,308 INFO    SenderThread:2342544 [file_pusher.py:join():176] waiting for file pusher
2024-01-24 15:03:34,429 INFO    MainThread:2342544 [internal.py:handle_exit():80] Internal process exited

Hi @202163002 , appreciate the details you provided. May I ask for a code snippet on how you are logging and the total # of intended points that was supposed to be logged?

I ran the program in pytorch_lightning framework, wandb initialization part:

logger = pl.loggers.WandbLogger(
        save_dir=args.logdir,
        project='ligand-pocket-ddpm',
        group=args.wandb_params.group,
        name=args.run_name,
        id=args.run_name,
        resume='must' if args.resume is not None else False,
        entity=args.wandb_params.entity,
        mode=args.wandb_params.mode,
    )

This function is then used to log data at each training and validation step:

def log_metrics(self, metrics_dict, split, batch_size=None, **kwargs):
        for m, value in metrics_dict.items():
            self.log(f'{m}/{split}', value, batch_size=batch_size, **kwargs)

I have 100 epochs set but currently only 4 epochs are recorded, the actual number that should be displayed should be around 300k.
Also, my trainer contains a callback section that may help you in your judgment.

checkpoint_callback = pl.callbacks.ModelCheckpoint(
        dirpath=Path(out_dir, 'checkpoints'),
        filename="best-model-epoch={epoch:02d}",
        monitor="loss/val",
        save_top_k=1,
        save_last=True,
        mode="min",
    )

Hi @202163002 , thank you for this details, we would like to have the whole file folder for debug logs and debug-internal please attached it here so we can review further.

Sure, thanks for your help. Due to space constraints, I’m sharing the full file contents via onedrive.
debug
debug-internal

Hi @202163002 , thank you for providing the logs though we don’t have a way to access it in onedrive, is it possible to upload it in a Google Drive please?

I have uploaded the log via Google Cloud Drive, thanks for your help. Please let me know if there are any more permissions issues and I’ll fix them as soon as possible, thanks again.
log

Appreciate this @202163002 , we’ll review the logs and get back to you again.

Hi @202163002 , assisting my colleague Joana further on this. We reviewed your logs and there isn’t anything definitive that is pointing to why data might fail to log. For the kwargs being passed to self.log, do you use step=?. I do also see that you run is being marked as failed in the UI, so if there is a process that crashed the remaining summary data may not have uploaded.

Additionally I noticed you are using client version 13.1, an outdate version of wandb

Over the past year and a half we have significant improvements in the reliability and performance of data logging including addressing a specific problem with filestreams timing out.

Could you update your client version to latest release pip install wandb -U and run a new experiment. Does the issue persist?

hi @202163002 , checking in to see if upgrading to latest version of sdk resolved this issue for you?