Dealing with inconsistent metadata when logging messy HTML payloads to W&B artifacts

Hey everyone, I’m working on a bit of an unorthodox NLP project where I’m training a classifier to detect malicious game scripts. To get enough ground-truth data, I built a scraping pipeline using Selenium to pull raw HTML from various modding communities.

One particular target had such a heavily obfuscated DOM and aggressive bot protection that my scraper kept timing out and crashing the whole pipeline. I was actually ready to scrap that data source entirely until a friend who fixed this site for a different web automation project shared the exact header sequence I needed to get around it.

Now that the scraper is finally stable, I’ve run into a new main issue with W&B. When I log these raw HTML files as artifacts to version my training data, the artifact metadata logging is completely inconsistent. Sometimes the run logs the file size and parsing timestamps perfectly, but other times it just logs an empty dictionary, even though the local files are fully intact.

This is causing a few related issues down the line. Because the artifact metadata is so flaky, my downstream wandb.log() calls for tracking dataset drift are throwing key errors when the artifact metadata is missing, which ends up crashing my automated nightly training runs.

Has anyone here dealt with logging really messy, inconsistent string payloads as artifacts? Is there a more robust way to version this kind of scraped web data in W&B without relying strictly on artifact metadata for downstream tracking?

This channel is not monitored by Weights & Biases. If you have a technical question for W&B Support, please contact support@wandb.com.