Additional System Metrics From e.g., `dstat`

Hi W&B Community,

Is there a possibility to get additional live system metrics like the network read/write rates, disk read/write rates, virtual memory major/minor page faults, filesystem inodes, and system context switches?

Basically, most of the metrics that dstat provides with the following flags:

  • –disk
  • –mem (memory)
  • –net (network)
  • –sys (system)
  • –fs (filesystem)
  • –vm (virtual memory)

I’m deep into pipeline profiling and found that having these helps a lot when looking for performance tuning opportunities. Also, allowing to add to the system metric log might be helpful generally to have everything related to actual ML in one log, and everything related to system metrics in another.

I saw that the current documentation suggests that you use this script - github(.)com/nicolargo/nvidia-ml-py3/blob/master/ - to get the GPU metrics, however, I did not find the system metrics there.

The first workaround for me would be to run dstat in parallel to the process, save the profiling log,
download your system metrics and join over the _timestamp. This, however, would negate your wonderful automatic visualization.

The other solution would be to use some system monitoring library and add manually via wandb.log({'my_metric': x}) to the “ML”-log. This would show the metric in your visualization but not at the correct place and would not be easily compared to the other system metrics. I do not know how well this would work in practice as there would need to be additions to this log ideally every (few) seconds. This would be an asynchronous running thread that is not inside of the training loop. The solution proposed here seems like it could work if I use “timestamps” as the X-axis? This still does not seem like a clean solution.

What are your thoughts on this proposed feature? I’m very much a novice regarding your service so I might not know the in’s and out’s, maybe I have overlooked some trivial solution.

Hi @cirquit!

Thank you for your feature request! I’m curious about your use case here - I definitely understand how network read/writes and disk read/writes could help here, but I’m curious about how you see page faults, inodes and context switches fit into your workflow.

They sound a little too low level for a typical ML workflow and I would love to hear how you see these incorporate into your workflow. This will help us create a better feature that better fits your needs.


Hi Ramit,

Thanks for the fast response! My use case is actually part of my research for my PhD, where I was lucky enough to publish a paper about preprocessing pipeline optimizations at SIGMOD '22. You’re very welcome to read it, but I will summarize some insights throughout this answer as well.

I’m generally focused on optimizing DL training pipelines, as we have seen multiple instances of underutilized hardware, be it due to inefficient user-level code, inefficient placement of jobs, or naive assumptions about performance that might often not hold (such as preprocessing being trivially parallelizable for example).

To directly answer your question, page faults and inodes were helpful to me to double-check how application-level caching was implemented in TF and to be maximally sure where data is read from, as debugging remote storage performance is not always deterministic. It is basically just a helpful tool to be sure of the current status of the pipeline.

Tracking context switches was the only way to get a glimpse as to why tiny sample sizes (<0.01MB) were deserialized very slowly with the TFRecord format, and why the speedup going from 1 to 8 threads was missing entirely.

The main reason why I think it might be helpful to include these system metrics is for debugging and performance tuning reasons. My first impression of W&B was a one-stop solution for tracking experiments and iterating on their performance, be it an ML model accuracy or the training time. Due to the nature of many people renting accelerators to run ML experiments, I think it might be very nice to have a way to efficiently allocate them by knowing how well specific system resources are used, or what the current bottleneck is. (e.g., renting a better CPU node vs. one with a bigger memory if a deserialization bottleneck removes the effects of caching).

The potential of adding these metrics would allow for a better comparison between heterogeneous hardware or even different DL software stacks, e.g., webdataset+Torch, datasets+Torch, DALI+TF, and their mixes. Right now it is not very easy to estimate if, for example, decoding JPEGs on the GPU is actually favorable for actual training throughput and accuracy as you reduce the GPU memory.

With additional system metrics, this becomes an easier task, especially if you can hand off these logs from the ML “training-person” to the ML “systems-optimizer-we-should-spend-less-money-on-resources-person” :slight_smile: Automatically analyzing these logs also seems to be like quite a good business opportunity :innocent:

But I would also be very glad if you would add at least the disk/network read/writes for the time being.
Creating a system to add specific system metrics to your profiling runs is not an easy task to do, especially if somebody like me comes around and might demand even more esoteric things like TCP or UDP stats.

I figure this might slightly change your platform’s focus not only for ML practitioners, but also target ML-SysOps people, but I very much hope that you are interested in that as well!

And sorry for this wall of text!


Hi Alex,

Thank you for that very detailed and insightful response regarding your request! I definitely see why this could be useful for optimizing features now, I had never considered how the rate of context switches could have a performance impact on the performance of an ML pipeline.

I’ll definitely go ahead and make a feature request for this, and I’ll keep you updated on the status of this request.


Thank you a lot, Ramit!

Maybe a short follow-up from my side - I decided to implement the system metric tracking manually with wandb.log() for now, as the essential condition for my profiling is to use the system metrics in a sweep and optimize for different optimization (system-level) targets. The default frequency by which it is recorded right now is unfortunately not high enough to allow quick experiments that run < 1min (e.g., an epoch on CIFAR, which encompasses data loading, data preprocessing, inference and backpropagation)

One easy example would be having batch_size and network_down_mb_s and being able to compare the increase of the network bandwidth with the batch size.

While my high-frequency needs might be too specific, having the system metrics available to pick in the hyperparameter parallel coordinates chart would be awesome in the long run.

1 Like