This happens to multiple users on other projects as well, not just me. If you look at the graphs here, as an example (Weights & Biases) you can see that if you move your mouse to the right side of the graph, they show different steps, so they’re not in sync with each other. This makes it really complicated to actually nicely figure out what’s happening with a run, and makes some graphs out of date with others. Occasionally the screen will refresh and some graphs will change to be more in date and others will move out of date, it seems random.
Is there something I can do about this?
Hi @kaiyotech ,
I’m not sure I completely understand the issue here, could you help me understand a little bit further? A screen recording might be a little more helpful to understand this.
Do you, by any chance, log all this metrics in different
Everytime you call this function the step counter is updated, If you were to call a single
wandb.log(...) with all your metrics it would solve your issue.
Here’s a screenshot that shows it. The only thing a video would show is that sometimes the screen refreshes and these numbers change, but they’re often not all in sync (sometimes they are though, especially early on in a run when steps is lower). You can see that 9790 is my actual latest step, but all my graphs are various amounts behind that.
The code is done with multiple logger.log() calls but all with commit=False until the last one so it commits all at once.
Am I correct in assuming that these charts do update eventually to the correct step? There is a few things that can cause an error here:
- Each call to
wandb.log counts as one step, so the graphs will not be the exact same step value unless logged together as:
stat/average_boost : <VALUE>,
- Additionally, I see that you are using
commit = False.
commit = False makes it such that the step value is not incremented when
log is called, so the
step counter has to be managed manually, otherwise there is a chance that some of your data is being overwritten. This also might be making your graphs look like they are out of sync.
I think eventually they’ll sync up maybe? I’m honestly not sure. I just went and looked and they’re still out of sync by about 10 steps or so. They do all keep moving, even if some are behind, if that makes sense.
All calls that have
commit=False has the step
step=iteration and then the final one without commit doesn’t have the step either, so it’s automatic.
There is a good chance that your code is working correctly then, and just has some lag. This could be for multiple reasons:
- If you are logging a lot of metrics in quick succession, we usually store them in a queue and bundle them up before sending them over to our servers, this is reduce the number of inbound network requests to the server and ensure server health.
- Network lag usually plays a role in how metrics are shown.
- Since you are using
commit=False, this would also add some delay to your metrics.
I would suggest ensuring that all your metrics are being logged by printing them to the console as well, since there is a good chance there is no bug here.