Trying to run wandb on azure ml, running into issues

Traceback (most recent call last):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1133, in init
    run = wi.init()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 787, in init
    run_start_result = run_start_handle.wait(timeout=30)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 271, in wait
    raise MailboxError("transport failed")
wandb.errors.MailboxError: transport failed
wandb: ERROR Abnormal program exit
2023-02-13 22:32:43,972 - mmseg - INFO - Loaded 20000 images
/mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py:235: UserWarning: Default ``avg_non_ignore`` is False, if you would like to ignore the certain label and average loss over non-ignore labels, which is the same with PyTorch official cross_entropy, set ``avg_non_ignore=True``.
  warnings.warn(
2023-02-13 22:32:52,439 - mmseg - INFO - Loaded 2500 images
2023-02-13 22:32:52,458 - mmseg - INFO - Start running, host: azureuser@vardhan-cvml, work_dir: /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/work_dirs/logs/deeplabv3plus
2023-02-13 22:32:52,459 - mmseg - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
2023-02-13 22:32:52,460 - mmseg - INFO - workflow: [('train', 1)], max: 50000 iters
2023-02-13 22:32:52,460 - mmseg - INFO - Checkpoints will be saved to /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/work_dirs/logs/deeplabv3plus by HardDiskBackend.
2023-02-13 22:32:52.816987: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 22:32:59.646354: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/cv2/../../lib64:
2023-02-13 22:32:59.646501: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/cv2/../../lib64:
2023-02-13 22:32:59.646517: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
wandb: Currently logged in as: don_v. Use `wandb login --relogin` to force relogin
Thread HandlerThread:
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 49, in run
    self._run()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 100, in _run
    self._process(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 280, in _process
    self._hm.handle(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 136, in handle
    handler(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 146, in handle_request
    handler(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 695, in handle_request_run_start
    self._system_monitor.probe(publish=True)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_monitor.py", line 186, in probe
    self.system_info.publish(system_info)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 252, in publish
    self._save_patches()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 134, in _save_patches
    if self.git.dirty:
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/git.py", line 76, in dirty
    return self.repo.is_dirty()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/repo/base.py", line 795, in is_dirty
    if osp.isfile(self.index.path) and len(self.git.diff("--cached", *default_args)):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 696, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 1270, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 1064, in execute
    raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(129)
  cmdline: git diff --cached --abbrev=40 --full-index --raw
  stderr: 'error: unknown option `cached'
usage: git diff --no-index [<options>] <path> <path>

Diff output format options
    -p, --patch           generate patch
    -s, --no-patch        suppress diff output
    -u                    generate patch
    -U, --unified[=<n>]   generate diffs with <n> lines context
    -W, --function-context
                          generate diffs with <n> lines context
    --raw                 generate the diff in raw format
    --patch-with-raw      synonym for '-p --raw'
    --patch-with-stat     synonym for '-p --stat'
    --numstat             machine friendly --stat
    --shortstat           output only the last line of --stat
    -X, --dirstat[=<param1,param2>...]
                          output the distribution of relative amount of changes for each sub-directory
    --cumulative          synonym for --dirstat=cumulative
    --dirstat-by-file[=<param1,param2>...]
                          synonym for --dirstat=files,param1,param2...
    --check               warn if changes introduce conflict markers or whitespace errors
    --summary             condensed summary such as creations, renames and mode changes
    --name-only           show only names of changed files
    --name-status         show only names and status of changed files
    --stat[=<width>[,<name-width>[,<count>]]]
                          generate diffstat
    --stat-width <width>  generate diffstat with a given width
    --stat-name-width <width>
                          generate diffstat with a given name width
    --stat-graph-width <width>
                          generate diffstat with a given graph width
    --stat-count <count>  generate diffstat with limited lines
    --compact-summary     generate compact summary in diffstat
    --binary              output a binary diff that can be applied
    --full-index          show full pre- and post-image object names on the "index" lines
    --color[=<when>]      show colored diff
    --ws-error-highlight <kind>
                          highlight whitespace errors in the 'context', 'old' or 'new' lines in the diff
    -z                    do not munge pathnames and use NULs as output field terminators in --raw or --numstat
    --abbrev[=<n>]        use <n> digits to display object names
    --src-prefix <prefix>
                          show the given source prefix instead of "a/"
    --dst-prefix <prefix>
                          show the given destination prefix instead of "b/"
    --line-prefix <prefix>
                          prepend an additional prefix to every line of output
    --no-prefix           do not show any source or destination prefix
    --inter-hunk-context <n>
                          show context between diff hunks up to the specified number of lines
    --output-indicator-new <char>
                          specify the character to indicate a new line instead of '+'
    --output-indicator-old <char>
                          specify the character to indicate an old line instead of '-'
    --output-indicator-context <char>
                          specify the character to indicate a context instead of ' '

Diff rename options
    -B, --break-rewrites[=<n>[/<m>]]
                          break complete rewrite changes into pairs of delete and create
    -M, --find-renames[=<n>]
                          detect renames
    -D, --irreversible-delete
                          omit the preimage for deletes
    -C, --find-copies[=<n>]
                          detect copies
    --find-copies-harder  use unmodified files as source to find copies
    --no-renames          disable rename detection
    --rename-empty        use empty blobs as rename source
    --follow              continue listing the history of a file beyond renames
    -l <n>                prevent rename/copy detection if the number of rename/copy targets exceeds given limit

Diff algorithm options
    --minimal             produce the smallest possible diff
    -w, --ignore-all-space
                          ignore whitespace when comparing lines
    -b, --ignore-space-change
                          ignore changes in amount of whitespace
    --ignore-space-at-eol
                          ignore changes in whitespace at EOL
    --ignore-cr-at-eol    ignore carrier-return at the end of line
    --ignore-blank-lines  ignore changes whose lines are all blank
    -I, --ignore-matching-lines <regex>
                          ignore changes whose all lines match <regex>
    --indent-heuristic    heuristic to shift diff hunk boundaries for easy reading
    --patience            generate diff using the "patience diff" algorithm
    --histogram           generate diff using the "histogram diff" algorithm
    --diff-algorithm <algorithm>
                          choose a diff algorithm
    --anchored <text>     generate diff using the "anchored diff" algorithm
    --word-diff[=<mode>]  show word diff, using <mode> to delimit changed words
    --word-diff-regex <regex>
                          use <regex> to decide what a word is
    --color-words[=<regex>]
                          equivalent to --word-diff=color --word-diff-regex=<regex>
    --color-moved[=<mode>]
                          moved lines of code are colored differently
    --color-moved-ws <mode>
                          how white spaces are ignored in --color-moved

Other diff options
    --relative[=<prefix>]
                          when run from subdir, exclude changes outside and show relative paths
    -a, --text            treat all files as text
    -R                    swap two inputs, reverse the diff
    --exit-code           exit with 1 if there were differences, 0 otherwise
    --quiet               disable all output of the program
    --ext-diff            allow an external diff helper to be executed
    --textconv            run external text conversion filters when comparing binary files
    --ignore-submodules[=<when>]
                          ignore changes to submodules in the diff generation
    --submodule[=<format>]
                          specify how differences in submodules are shown
    --ita-invisible-in-index
                          hide 'git add -N' entries from the index
    --ita-visible-in-index
                          treat 'git add -N' entries as real in the index
    -S <string>           look for differences that change the number of occurrences of the specified string
    -G <regex>            look for differences that change the number of occurrences of the specified regex
    --pickaxe-all         show all changes in the changeset with -S or -G
    --pickaxe-regex       treat <string> in -S as extended POSIX regular expression
    -O <file>             control the order in which files appear in the output
    --rotate-to <path>    show the change in the specified path first
    --skip-to <path>      skip the output to the specified path
    --find-object <object-id>
                          look for differences that change the number of occurrences of the specified object
    --diff-filter [(A|C|D|M|R|T|U|X|B)...[*]]
                          select files by diff type
    --output <file>       output to a specific file
'
wandb: ERROR Internal wandb error: file data was not synced
Problem at: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/hooks/logger/wandb.py 83 before_run
---------------------------------------------------------------------------
MailboxError                              Traceback (most recent call last)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:1133, in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
   1132 try:
-> 1133     run = wi.init()
   1134     except_exit = wi.settings._except_exit

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:787, in _WandbInit.init(self)
    786 # TODO: add progress to let user know we are doing something
--> 787 run_start_result = run_start_handle.wait(timeout=30)
    788 if run_start_result is None:

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py:271, in MailboxHandle.wait(self, timeout, on_probe, on_progress, release, cancel)
    270     if self._interface._transport_keepalive_failed():
--> 271         raise MailboxError("transport failed")
    273 found, abandoned = self._slot._get_and_clear(timeout=wait_timeout)

MailboxError: transport failed

The above exception was the direct cause of the following exception:

Exception                                 Traceback (most recent call last)
Input In [8], in <cell line: 20>()
     14 model.CLASSES = datasets[0].CLASSES
     16 # Create work_dir
     17 # mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
---> 20 train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
     21                 meta=dict())

File /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/apis/train.py:194, in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
    192 elif cfg.load_from:
    193     runner.load_checkpoint(cfg.load_from)
--> 194 runner.run(data_loaders, cfg.workflow)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py:126, in IterBasedRunner.run(self, data_loaders, workflow, max_iters, **kwargs)
    122 self.logger.info('Hooks will be executed in the following order:\n%s',
    123                  self.get_hook_info())
    124 self.logger.info('workflow: %s, max: %d iters', workflow,
    125                  self._max_iters)
--> 126 self.call_hook('before_run')
    128 iter_loaders = [IterLoader(x) for x in data_loaders]
    130 self.call_hook('before_epoch')

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/base_runner.py:317, in BaseRunner.call_hook(self, fn_name)
    310 """Call all hooks.
    311 
    312 Args:
    313     fn_name (str): The function name in each hook to be called, such as
    314         "before_train_epoch".
    315 """
    316 for hook in self._hooks:
--> 317     getattr(hook, fn_name)(self)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/dist_utils.py:135, in master_only.<locals>.wrapper(*args, **kwargs)
    133 rank, _ = get_dist_info()
    134 if rank == 0:
--> 135     return func(*args, **kwargs)

File /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/core/hook/wandblogger_hook.py:106, in MMSegWandbHook.before_run(self, runner)
    104 @master_only
    105 def before_run(self, runner):
--> 106     super(MMSegWandbHook, self).before_run(runner)
    108     # Check if EvalHook and CheckpointHook are available.
    109     for hook in runner.hooks:

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/dist_utils.py:135, in master_only.<locals>.wrapper(*args, **kwargs)
    133 rank, _ = get_dist_info()
    134 if rank == 0:
--> 135     return func(*args, **kwargs)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/hooks/logger/wandb.py:83, in WandbLoggerHook.before_run(self, runner)
     81     self.import_wandb()
     82 if self.init_kwargs:
---> 83     self.wandb.init(**self.init_kwargs)  # type: ignore
     84 else:
     85     self.wandb.init()

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:1170, in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
   1168         if except_exit:
   1169             os._exit(1)
-> 1170         raise Exception("problem") from error_seen
   1171 return run

Exception: problem

Hello Vardhan!

In order to get idea of what the issue may be, could you provide me with your debug.log and debug-internal.log for this specific run? They should be located in the wandb folder in your computer’s working directory. That folder has folders formatted as run-DATETIME-ID - each of which is associated with an individual run.

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi Vardhan, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.