Control knobs for sending commands back to the running job / controlling live variables from the das

:rocket: Feature

There should be a way to attach variables to the logger, that you can modify live from from controls in the dashboard.

Motivation

When you have a running job and are monitoring the progress, you sometimes want to adjust the learning rate or other hyperparameter (should we switch to fine-tuning mode, etc.).

Pitch

This is a bit of a unspoken black-magic deep learning technique. However, if you read papers from Meta, etc. or talk to hardcore old-school practitioners, they have these super long-running difficult optimization problems, and say something like: โ€œWell we trained the generator for X thousand epochs, then we enabled the discriminator, then Y thousand epochs later we dropped the learning rate, etc.โ€ This is ideally done by monitoring a live, running job and modifying the variables in situ.

Alternatives

  • The non-agile way to do this is let your run go for a while, decide afterwards that you should have changed something at some point in time, code that, run it again and cross your fingers. This is obviously pretty slow and requires luck.
  • A hacky way to do this is to create a DSL with sentinel files that the running job reads and applies. However, the workflow is useful enough that there should be a common way to do this.

Additional context

Iโ€™m not aware of any logging library that does this. So it would make great blog posts to show off and attract more users.

Hi @turian, thank you for the feature request as well as the use-case this would unlock! I will go ahead and submit this to engineering team and follow up once they have a chance to look into this.

Thank you,
Nate

1 Like

@nathank Thanks. Here is another simple example, which would make a nice demonstration for a blog post:

I was recently doing a randomized grid search, running 8 jobs simultaneously. After looking at a few training runs, it was clear that any model that did not achieve loss of 0.1 by batch 1000 should be stopped and restarted with new hyperparameters. So this is something that would be useful to control from the dashboard.

(Another similar example is that I would then go manually adjust the grid script by hand, to remove learning rates that were too high or too low. I would prefer to do that from the dashboard.)

1 Like