wandb.Table does not update properly

I am trying to upload examples of my LLM during training to understand how well it is doing.

Here is an overview of what I am doing:

import wandb

data = [] # create fake data per epoch
for i in range(5):
    data.append([[f"in1-{i}", f"out1-{i}", f"{i}"], [f"in2-{i}", f"out2-{i}", f"{i}"]])

wandb.init(project="Learn-Table", job_type="train", config={"seed": 1})

table = wandb.Table(columns=["Input", "Output", "Index"])

for i, d in enumerate(data):
    [table.add_data(*x) for x in d]
    wandb.log({"table": table})

wandb.finish()

However, this does not work. In the local wandb logging folder the run/files/media/table/table_0_.....table.json only the first row is visible. So it does not seem to re-log when data is added:

{"columns": ["Input", "Output", "Index"], "data": [["in1-0", "out1-0", "0"], ["in2-0", "out2-0", "0"]]}

Online the same table sometimes shows twice, sometimes thrice. It contains the first table.

Alternative Idea

Recreate a new table everytime. This is NOT what I want. I want one table that gets appended every epoch.

import wandb

data = [] # create fake data per epoch
for i in range(5):
    data.append([[f"in1-{i}", f"out1-{i}", f"{i}"], [f"in2-{i}", f"out2-{i}", f"{i}"]])

wandb.init(project="Learn-Table", job_type="train", config={"seed": 1})

for i, d in enumerate(data):
    table = wandb.Table(data=d, columns=["Input", "Output", "Index"])
    wandb.log({"table": table})

wandb.finish()

This also doesn’t work either. Locally, I see 5 different tables now in the logs. Online I see the same table twice or thrice, again. However, this time the last entry.

Conclusion

This all seems odd. Why is the same table shown multiple times? Why are tables in local logs that are not online?

Ideally, I would like to have one table that gets appended regularly. How do I do that?

Hi @cemde , thank you for reaching out with your questions and example.

Whenever the wandb.log({"table": table}) is called, a new version for the table is logged, and this version only contains passed to it after initialised and logged.

If you want to log the table containing all the data while the training is running, you will have to keep storing all the data separately and log it all everytime the table is logged, for example:

import wandb

data = [] # create an empty list to store fake data
wandb.init(project="Learn-Table-test", job_type="train", config={"seed": 1})


for i in range(5):
    # At each step append new fake data, then log the table with all the data
    data.extend([[f"in1-{i}", f"out1-{i}", f"{i}"],[f"in2-{i}", f"out2-{i}", f"{i}"]])
    table = wandb.Table(columns=["Input", "Output", "Index"], data=data)
    wandb.log({"table": table})

wandb.finish()

Regarding the table being duplicated and showing the same data:

  • When calling wandb.log({"table":table}) a Query Panel with the query runs.summary["table"] is created in the workspace. Calling it for the same table should not generate a new panel (I can see it does with the second code snippet you shared, so will have to look into it).
  • This type of panel displays the latest version of the Table, that is stored as an Artifact. You can: go to the artifact section → for each run you will see that there is a run-<id>-table artifact collection with one version per .log call. In your case each version will only contain the two lines. Each run.summary[“table”] will shows the latest version only.

This Workspace is an example of how the table would look with the code snippet I shared above:

and you can also see here how the table is stored as artifact with different versions logged through the experiment:

Please let me know if you have any further questions on this.

hi @fmamberti-wandb thank you for the detailed response.

Whenever the wandb.log({"table": table}) is called, a new version for the table is logged, and this version only contains passed to it after initialised and logged.

Thank you for clarifying this.

Do I understand it correctly that once run.log is called on an object of class wandb.Table, it will not be logged again, even though its content changes? For example,

# wrong
table = wandb.Table()
for i in range(5):
    table.add_data(...)
    run.log({"table": table})

is not expected to work. But,

# right
data = []
for i in range(5):
    data.append(...)
    table = wandb.Table(data)
    run.log({"table": table})

is expected to work because the table object is new?!

Your example

With your code example, I get exactly what I am expecting in the local log directory and in the artefacts section of my run (5 artefacts with increasing number of rows) on wandb.ai. However, the main W&B Workspace for this particular run still shows me the exact same table 6 times. Each table contains the exact same content, which however, is not up to date to the latest artefact. It only shows the first 8 rows instead of all 10.

This surely is a bug isn’t it?

Hi @cemde , happy to be of help!

the main W&B Workspace for this particular run still shows me the exact same table 6 times

This may be because the additional panels were incorrectly created during previous Runs and have been added to the workspace. Could you try to delete the duplicate ones and leave just one of them to see if they get duplicated again once you log new runs?

Each table contains the exact same content, which however, is not up to date to the latest artefact. It only shows the first 8 rows instead of all 10

Do you have a URL you can share for this? Could it be that only the first few rows are shown and you have to click to next_page at the bottom of the table to see the next two?

Regarding the wrong /right examples you shared: that is correct, as the table has been logged, it won’t be logged again if you add data, while if you create a new table object, with the whole data, this will be logged.