Wandb stops uploading data

Why does wandb’s visualization stop at a specific training step without updating when training is not stopped ?Even though I have turned down the memory usage, it still doesn’t work. Also, the logs show that no exceptions were thrown and the project status is still “running”.

Hi @202163002 , happy to help. To assist further I would like to ask below:

I would like to confirm from what you said, the run is still active and the metrics aren’t updating-did you see any of the metrics update when the run completed?

How many metrics are being logged?

Are you running locally or on public cloud(wandb.ai)? If it is on public would you object to share a Run link for this?

The metrics were not updated, even though the training process was over. In fact I have twenty-five metrics logged in the program, including multiple partial losses and various validation metrics. Also, I used the “offline” setting after discovering the problem, but it still doesn’t show the complete training process after synchronizing the data.
Here is a screenshot of the workplace interface for the project. Unfortunately I didn’t use the public cloud for logging, it runs at Weights & Biases, hopefully this can be helpful to you.

Thank you for the details @202163002 , the run seems to have failed. May we request for the SDK Debug logs so we can investigate further.

Thank you very much for your response. Actually, the task in the image is actually completed, however it stops at around 10,000 steps, a value that is the same for repeated training. I’m sorry I couldn’t find the SDK you mentioned, I’ve provided some of the logs that wandb keeps locally (debug and debug-internal), hopefully they’ll be useful in your judgment. In addition, I adjusted the program from private to public.(Weights & Biases)

2024-01-24 15:03:34,429 WARNING MsgRouterThr:2341252 [router.py:message_loop():76] message_loop has been closed
2024-01-24 15:03:30,034 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,135 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:30,136 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,211 INFO    Thread-20 :2342544 [upload_job.py:push():138] Uploaded file /mnt/data1/home/xjru/Desktop/sbdd-main/logdir/wandb/run-20240124_093535-SE3-joint-fullAtom/files/output.log
2024-01-24 15:03:30,237 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:30,237 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,339 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:30,339 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:30,411 INFO    Thread-11 (_thread_body):2342544 [sender.py:transition_state():459] send defer: 9
2024-01-24 15:03:30,412 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: defer
2024-01-24 15:03:30,412 INFO    HandlerThread:2342544 [handler.py:handle_request_defer():164] handle defer: 9
2024-01-24 15:03:30,412 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: defer
2024-01-24 15:03:30,413 INFO    SenderThread:2342544 [sender.py:send_request_defer():455] handle sender defer: 9
2024-01-24 15:03:30,441 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:31,724 INFO    SenderThread:2342544 [sender.py:transition_state():459] send defer: 10
2024-01-24 15:03:31,724 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:31,725 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: defer
2024-01-24 15:03:31,725 INFO    HandlerThread:2342544 [handler.py:handle_request_defer():164] handle defer: 10
2024-01-24 15:03:31,726 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: defer
2024-01-24 15:03:31,726 INFO    SenderThread:2342544 [sender.py:send_request_defer():455] handle sender defer: 10
2024-01-24 15:03:31,727 INFO    SenderThread:2342544 [sender.py:transition_state():459] send defer: 11
2024-01-24 15:03:31,727 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: defer
2024-01-24 15:03:31,728 DEBUG   SenderThread:2342544 [sender.py:send():302] send: final
2024-01-24 15:03:31,728 INFO    HandlerThread:2342544 [handler.py:handle_request_defer():164] handle defer: 11
2024-01-24 15:03:31,728 DEBUG   SenderThread:2342544 [sender.py:send():302] send: footer
2024-01-24 15:03:31,728 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: defer
2024-01-24 15:03:31,728 INFO    SenderThread:2342544 [sender.py:send_request_defer():455] handle sender defer: 11
2024-01-24 15:03:31,827 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: poll_exit
2024-01-24 15:03:31,828 DEBUG   SenderThread:2342544 [sender.py:send_request():316] send_request: poll_exit
2024-01-24 15:03:31,828 INFO    SenderThread:2342544 [file_pusher.py:join():176] waiting for file pusher
2024-01-24 15:03:33,408 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: sampled_history
2024-01-24 15:03:33,418 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: get_summary
2024-01-24 15:03:33,419 INFO    MainThread:2342544 [wandb_run.py:_footer_history_summary_info():3305] rendering history
2024-01-24 15:03:33,424 INFO    MainThread:2342544 [wandb_run.py:_footer_history_summary_info():3337] rendering summary
2024-01-24 15:03:33,425 INFO    MainThread:2342544 [wandb_run.py:_footer_sync_info():3261] logging synced files
2024-01-24 15:03:33,426 DEBUG   HandlerThread:2342544 [handler.py:handle_request():141] handle_request: shutdown
2024-01-24 15:03:33,426 INFO    HandlerThread:2342544 [handler.py:finish():810] shutting down handler
2024-01-24 15:03:33,728 INFO    WriterThread:2342544 [datastore.py:close():279] close: /mnt/data1/home/xjru/Desktop/sbdd-main/logdir/wandb/run-20240124_093535-SE3-joint-fullAtom/run-SE3-joint-fullAtom.wandb
2024-01-24 15:03:34,306 INFO    SenderThread:2342544 [sender.py:finish():1312] shutting down sender
2024-01-24 15:03:34,307 INFO    SenderThread:2342544 [file_pusher.py:finish():171] shutting down file pusher
2024-01-24 15:03:34,308 INFO    SenderThread:2342544 [file_pusher.py:join():176] waiting for file pusher
2024-01-24 15:03:34,429 INFO    MainThread:2342544 [internal.py:handle_exit():80] Internal process exited

Hi @202163002 , appreciate the details you provided. May I ask for a code snippet on how you are logging and the total # of intended points that was supposed to be logged?

I ran the program in pytorch_lightning framework, wandb initialization part:

logger = pl.loggers.WandbLogger(
        save_dir=args.logdir,
        project='ligand-pocket-ddpm',
        group=args.wandb_params.group,
        name=args.run_name,
        id=args.run_name,
        resume='must' if args.resume is not None else False,
        entity=args.wandb_params.entity,
        mode=args.wandb_params.mode,
    )

This function is then used to log data at each training and validation step:

def log_metrics(self, metrics_dict, split, batch_size=None, **kwargs):
        for m, value in metrics_dict.items():
            self.log(f'{m}/{split}', value, batch_size=batch_size, **kwargs)

I have 100 epochs set but currently only 4 epochs are recorded, the actual number that should be displayed should be around 300k.
Also, my trainer contains a callback section that may help you in your judgment.

checkpoint_callback = pl.callbacks.ModelCheckpoint(
        dirpath=Path(out_dir, 'checkpoints'),
        filename="best-model-epoch={epoch:02d}",
        monitor="loss/val",
        save_top_k=1,
        save_last=True,
        mode="min",
    )

Hi @202163002 , thank you for this details, we would like to have the whole file folder for debug logs and debug-internal please attached it here so we can review further.

Sure, thanks for your help. Due to space constraints, I’m sharing the full file contents via onedrive.
debug
debug-internal

Hi @202163002 , thank you for providing the logs though we don’t have a way to access it in onedrive, is it possible to upload it in a Google Drive please?

I have uploaded the log via Google Cloud Drive, thanks for your help. Please let me know if there are any more permissions issues and I’ll fix them as soon as possible, thanks again.
log

Appreciate this @202163002 , we’ll review the logs and get back to you again.

Hi @202163002 , assisting my colleague Joana further on this. We reviewed your logs and there isn’t anything definitive that is pointing to why data might fail to log. For the kwargs being passed to self.log, do you use step=?. I do also see that you run is being marked as failed in the UI, so if there is a process that crashed the remaining summary data may not have uploaded.

Additionally I noticed you are using client version 13.1, an outdate version of wandb

Over the past year and a half we have significant improvements in the reliability and performance of data logging including addressing a specific problem with filestreams timing out.

Could you update your client version to latest release pip install wandb -U and run a new experiment. Does the issue persist?

hi @202163002 , checking in to see if upgrading to latest version of sdk resolved this issue for you?

Unfortunately, I updated the wandb version to 16.3 and the above problem is still not solved. Also, the step parameter is not imported into self.log in the program, here is the part of the program that involves the use of the log_metrics function.

self.log_metrics(info, ‘train’, batch_size=len(data[‘num_lig_atoms’]))

self.log_metrics(info, prefix, batch_size=len(data[‘num_lig_atoms’]),
sync_dist=True)

self.log_metrics(sampling_results, ‘val’)

During the attempt, I also had another problem, after changing parameters such as learning rate or batch size, the loss/train curve obtained from workplace in wandb does not change, but the actual loss changes during training. I’m wondering if this is the reason why wandb is set up incorrectly.

I recently tried training in offline mode before synchronizing the data, but got the following warning. The resulting graph is still wrong.

Syncing: Weights & Biases … wandb: WARNING No requirements.txt found, not creating job artifact.

Hi @202163002 thank you for the additional details. Unfortunately I have been unable to reproduce this behavior.

To best troubleshoot this could you provide us with a toy example that mimics this?

Some insight into our step logic that may be contributing, depending on your experiment setup

Step logic rules:

  • We keep an internal step that is the largest step we saw for a run so far, let’s call it main_step , after every successful commit we increment it by 1.
  • Steps have to be monotonically non-increasing (a requirement that is imposed by the backend), hence providing a step that is smaller than this main_step results in dropping this entry (I don’t believe this is what you are doing here but wanted to flag as wandb will ignore your log calls )
  • log({...}) is the same as doing log({...}, step=main_step, commit=True) , which means we are going to log it right away.
  • log({...}, step=step) is the same as doing log({...}, step=step, commit=False) it will be accumulated (internally) until({...}, step=step1, ...) where step1 > step and then we will commit the data

However, if you use log(..., step=step, commit=True) , then wandb will automatically compare the reported step and the main step and immediately log the data. Depending on how step is used, it could be a reason as to why some values are missing until the next evaluation period.

Thanks to your help, I think I’ve solved the problem I encountered. After I tried adding the initialization statement to specify the id, the new running project in wandb was able to display the uploaded data properly.

logger = pl.loggers.WandbLogger(
save_dir=args.logdir,
project=‘ligand-pocket-ddpm’,
name=f’{args.run_name}-{current_time}‘,
id=f’{args.run_name}-{current_time}',
resume=‘must’ if args.resume is not None else False,
entity=args.wandb_params.entity,
mode=args.wandb_params.mode,
)
I speculate that the previous problem may have arisen due to duplicate loading of data from the first project. Thank you very much for your help.

1 Like

Thank you very much for letting me know the issue is now resolved for you. I will mark this resolved.