Forcing Pre-emption in a sweep

This is a bit of a weird one…
My lab has a cluster that usually runs using slurm, but slurm is down (and may be not up for a while). We would like to still use wandb and still maintain the priority levels that slurm gives (i.e. we want to make sure if there is some critical jobs that need to get run, we can easily pre-empt existing jobs that are running in sweeps and still get them to requeue later when the critical jobs are over (essentially we are trying to do manual pre-emption)

I have been trying to set this up with a dummy sweep that just sends a single “magic number” to each run, however no matter what I do I cannot seem to get the run to be pre-empted. If I try killing via ctrl+C, the wandb process shuts down normally and it marks the run as finished. Is there any way I can get around this to force the pre-emption? Thank you so much for your help!

import wandb
import time
from random import randint
import os
import sys
from tqdm.auto import tqdm
wandb.init()
magic_number = wandb.config.magic_number
try:
    print(f'Hello, world! Magic number is {magic_number}')
    print('My PID is', os.getpid())
    size = 1_000_000_000
    for count in tqdm(range(size)):
        if count % (size // 10) == 0:
            print(f'On count {count}')
except (Exception, KeyboardInterrupt, SystemExit) as e:
    print('Keyboard interrupt!')
    # I cannot reach this piece of code no matter when I do
    # I have tried ctrl+c, killing the process corresponding to this python script, killing the wandb agent process
    wandb.mark_preempting()
    print('Preempted!')
    sys.exit(999)
print('Done!')

Hi @evanv , happy to help. As a point of clarification, are you wanting t to pre-empt a run called by a sweep?

I’m not quite sure what could be happening on your end. I ran your exact same code, made a minor tweak by passing a config that includes a ‘magic_number’ to wandb.init(). I successfully ran the code example, manually killed the run via ctrl+c` and was able to place the run in a preempted state.

  config = {"magic_number":10}
  wandb.init(config=config)
  magic_number = wandb.config.magic_number
  try:
      print(f'Hello, world! Magic number is {magic_number}')
      print('My PID is', os.getpid())
      size = 1_000_000_000
      for count in tqdm(range(size)):
          if count % (size // 10) == 0:
              print(f'On count {count}')
  except (Exception, KeyboardInterrupt, SystemExit) as e:
      #I was successful in triggering this except block with ctrl+c
      print('Keyboard interrupt!')
      wandb.mark_preempting()
      print('Preempted!')
      sys.exit(999)
  print('Done!')

The produced ran can be viewed here. Could you provide me a link to your workspace where the runs are showing up as finished.

Hey @mohammadbakir ,

Yes, I am trying to pre-empt a run which is called by a sweep. I may indeed have had some kind of strange issue on my end because your code also works for me. However one thing I noticed is that the runs which are pre-empted seem not to be immediately re-qued by a sweep. Interestingly it looks like you were able to use the wand.mark_preempting without a sweep in progress. Are you able to give a bit more insight into how this command works and what its implications are for the run queue? The documentation seems to say they should be immediately re-queued, but I have not observed that to be the case. Is there any way I can check the queue to see what is happening?

Hi @evanv wandb.mark_preempting() will mark a run preempting, but the run is not requeued until the status is preempted. The status change preemptingpreempted happens when the run exits with non zero status (maybe your signal handler is preventing this) or after the run spends 5 minutes in the preempting state and our backend receives no heartbeats from the run. If the run exits successfully (with zero status) after being put into the preempting state, we assume the run finished successfully before being preempted by the server and the run state is set to finished. In this case the run will not be requeued.

There are likely two ways to have control over your exist status and force this preemptingpreempted

  1. call wandb.finish(exit_code=1) after you mark the run as prempting
  2. Make your process exit with a non-zero status, exit(1)

Hi @evanv since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!