Hi W&B Community,
Is there a possibility to get additional live system metrics like the network read/write rates, disk read/write rates, virtual memory major/minor page faults, filesystem inodes, and system context switches?
Basically, most of the metrics that dstat provides with the following flags:
- –disk
- –mem (memory)
- –net (network)
- –sys (system)
- –fs (filesystem)
- –vm (virtual memory)
I’m deep into pipeline profiling and found that having these helps a lot when looking for performance tuning opportunities. Also, allowing to add to the system metric log might be helpful generally to have everything related to actual ML in one log, and everything related to system metrics in another.
I saw that the current documentation suggests that you use this script - github(.)com/nicolargo/nvidia-ml-py3/blob/master/pynvml.py - to get the GPU metrics, however, I did not find the system metrics there.
The first workaround for me would be to run dstat
in parallel to the process, save the profiling log,
download your system metrics and join over the _timestamp
. This, however, would negate your wonderful automatic visualization.
The other solution would be to use some system monitoring library and add manually via wandb.log({'my_metric': x})
to the “ML”-log. This would show the metric in your visualization but not at the correct place and would not be easily compared to the other system metrics. I do not know how well this would work in practice as there would need to be additions to this log ideally every (few) seconds. This would be an asynchronous running thread that is not inside of the training loop. The solution proposed here seems like it could work if I use “timestamps” as the X-axis? This still does not seem like a clean solution.
What are your thoughts on this proposed feature? I’m very much a novice regarding your service so I might not know the in’s and out’s, maybe I have overlooked some trivial solution.