Trying to run wandb on azure ml, running into issues

don_v · February 13, 2023, 10:36pm

Traceback (most recent call last):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1133, in init
    run = wi.init()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 787, in init
    run_start_result = run_start_handle.wait(timeout=30)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 271, in wait
    raise MailboxError("transport failed")
wandb.errors.MailboxError: transport failed
wandb: ERROR Abnormal program exit
2023-02-13 22:32:43,972 - mmseg - INFO - Loaded 20000 images
/mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py:235: UserWarning: Default ``avg_non_ignore`` is False, if you would like to ignore the certain label and average loss over non-ignore labels, which is the same with PyTorch official cross_entropy, set ``avg_non_ignore=True``.
  warnings.warn(
2023-02-13 22:32:52,439 - mmseg - INFO - Loaded 2500 images
2023-02-13 22:32:52,458 - mmseg - INFO - Start running, host: azureuser@vardhan-cvml, work_dir: /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/work_dirs/logs/deeplabv3plus
2023-02-13 22:32:52,459 - mmseg - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
(VERY_LOW    ) MMSegWandbHook                     
 -------------------- 
2023-02-13 22:32:52,460 - mmseg - INFO - workflow: [('train', 1)], max: 50000 iters
2023-02-13 22:32:52,460 - mmseg - INFO - Checkpoints will be saved to /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/work_dirs/logs/deeplabv3plus by HardDiskBackend.
2023-02-13 22:32:52.816987: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 22:32:59.646354: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/cv2/../../lib64:
2023-02-13 22:32:59.646501: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/cv2/../../lib64:
2023-02-13 22:32:59.646517: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
wandb: Currently logged in as: don_v. Use `wandb login --relogin` to force relogin
Thread HandlerThread:
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 49, in run
    self._run()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 100, in _run
    self._process(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 280, in _process
    self._hm.handle(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 136, in handle
    handler(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 146, in handle_request
    handler(record)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 695, in handle_request_run_start
    self._system_monitor.probe(publish=True)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_monitor.py", line 186, in probe
    self.system_info.publish(system_info)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 252, in publish
    self._save_patches()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 134, in _save_patches
    if self.git.dirty:
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/git.py", line 76, in dirty
    return self.repo.is_dirty()
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/repo/base.py", line 795, in is_dirty
    if osp.isfile(self.index.path) and len(self.git.diff("--cached", *default_args)):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 696, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 1270, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 1064, in execute
    raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(129)
  cmdline: git diff --cached --abbrev=40 --full-index --raw
  stderr: 'error: unknown option `cached'
usage: git diff --no-index [<options>] <path> <path>

Diff output format options
    -p, --patch           generate patch
    -s, --no-patch        suppress diff output
    -u                    generate patch
    -U, --unified[=<n>]   generate diffs with <n> lines context
    -W, --function-context
                          generate diffs with <n> lines context
    --raw                 generate the diff in raw format
    --patch-with-raw      synonym for '-p --raw'
    --patch-with-stat     synonym for '-p --stat'
    --numstat             machine friendly --stat
    --shortstat           output only the last line of --stat
    -X, --dirstat[=<param1,param2>...]
                          output the distribution of relative amount of changes for each sub-directory
    --cumulative          synonym for --dirstat=cumulative
    --dirstat-by-file[=<param1,param2>...]
                          synonym for --dirstat=files,param1,param2...
    --check               warn if changes introduce conflict markers or whitespace errors
    --summary             condensed summary such as creations, renames and mode changes
    --name-only           show only names of changed files
    --name-status         show only names and status of changed files
    --stat[=<width>[,<name-width>[,<count>]]]
                          generate diffstat
    --stat-width <width>  generate diffstat with a given width
    --stat-name-width <width>
                          generate diffstat with a given name width
    --stat-graph-width <width>
                          generate diffstat with a given graph width
    --stat-count <count>  generate diffstat with limited lines
    --compact-summary     generate compact summary in diffstat
    --binary              output a binary diff that can be applied
    --full-index          show full pre- and post-image object names on the "index" lines
    --color[=<when>]      show colored diff
    --ws-error-highlight <kind>
                          highlight whitespace errors in the 'context', 'old' or 'new' lines in the diff
    -z                    do not munge pathnames and use NULs as output field terminators in --raw or --numstat
    --abbrev[=<n>]        use <n> digits to display object names
    --src-prefix <prefix>
                          show the given source prefix instead of "a/"
    --dst-prefix <prefix>
                          show the given destination prefix instead of "b/"
    --line-prefix <prefix>
                          prepend an additional prefix to every line of output
    --no-prefix           do not show any source or destination prefix
    --inter-hunk-context <n>
                          show context between diff hunks up to the specified number of lines
    --output-indicator-new <char>
                          specify the character to indicate a new line instead of '+'
    --output-indicator-old <char>
                          specify the character to indicate an old line instead of '-'
    --output-indicator-context <char>
                          specify the character to indicate a context instead of ' '

Diff rename options
    -B, --break-rewrites[=<n>[/<m>]]
                          break complete rewrite changes into pairs of delete and create
    -M, --find-renames[=<n>]
                          detect renames
    -D, --irreversible-delete
                          omit the preimage for deletes
    -C, --find-copies[=<n>]
                          detect copies
    --find-copies-harder  use unmodified files as source to find copies
    --no-renames          disable rename detection
    --rename-empty        use empty blobs as rename source
    --follow              continue listing the history of a file beyond renames
    -l <n>                prevent rename/copy detection if the number of rename/copy targets exceeds given limit

Diff algorithm options
    --minimal             produce the smallest possible diff
    -w, --ignore-all-space
                          ignore whitespace when comparing lines
    -b, --ignore-space-change
                          ignore changes in amount of whitespace
    --ignore-space-at-eol
                          ignore changes in whitespace at EOL
    --ignore-cr-at-eol    ignore carrier-return at the end of line
    --ignore-blank-lines  ignore changes whose lines are all blank
    -I, --ignore-matching-lines <regex>
                          ignore changes whose all lines match <regex>
    --indent-heuristic    heuristic to shift diff hunk boundaries for easy reading
    --patience            generate diff using the "patience diff" algorithm
    --histogram           generate diff using the "histogram diff" algorithm
    --diff-algorithm <algorithm>
                          choose a diff algorithm
    --anchored <text>     generate diff using the "anchored diff" algorithm
    --word-diff[=<mode>]  show word diff, using <mode> to delimit changed words
    --word-diff-regex <regex>
                          use <regex> to decide what a word is
    --color-words[=<regex>]
                          equivalent to --word-diff=color --word-diff-regex=<regex>
    --color-moved[=<mode>]
                          moved lines of code are colored differently
    --color-moved-ws <mode>
                          how white spaces are ignored in --color-moved

Other diff options
    --relative[=<prefix>]
                          when run from subdir, exclude changes outside and show relative paths
    -a, --text            treat all files as text
    -R                    swap two inputs, reverse the diff
    --exit-code           exit with 1 if there were differences, 0 otherwise
    --quiet               disable all output of the program
    --ext-diff            allow an external diff helper to be executed
    --textconv            run external text conversion filters when comparing binary files
    --ignore-submodules[=<when>]
                          ignore changes to submodules in the diff generation
    --submodule[=<format>]
                          specify how differences in submodules are shown
    --ita-invisible-in-index
                          hide 'git add -N' entries from the index
    --ita-visible-in-index
                          treat 'git add -N' entries as real in the index
    -S <string>           look for differences that change the number of occurrences of the specified string
    -G <regex>            look for differences that change the number of occurrences of the specified regex
    --pickaxe-all         show all changes in the changeset with -S or -G
    --pickaxe-regex       treat <string> in -S as extended POSIX regular expression
    -O <file>             control the order in which files appear in the output
    --rotate-to <path>    show the change in the specified path first
    --skip-to <path>      skip the output to the specified path
    --find-object <object-id>
                          look for differences that change the number of occurrences of the specified object
    --diff-filter [(A|C|D|M|R|T|U|X|B)...[*]]
                          select files by diff type
    --output <file>       output to a specific file
'
wandb: ERROR Internal wandb error: file data was not synced
Problem at: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/hooks/logger/wandb.py 83 before_run
---------------------------------------------------------------------------
MailboxError                              Traceback (most recent call last)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:1133, in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
   1132 try:
-> 1133     run = wi.init()
   1134     except_exit = wi.settings._except_exit

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:787, in _WandbInit.init(self)
    786 # TODO: add progress to let user know we are doing something
--> 787 run_start_result = run_start_handle.wait(timeout=30)
    788 if run_start_result is None:

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py:271, in MailboxHandle.wait(self, timeout, on_probe, on_progress, release, cancel)
    270     if self._interface._transport_keepalive_failed():
--> 271         raise MailboxError("transport failed")
    273 found, abandoned = self._slot._get_and_clear(timeout=wait_timeout)

MailboxError: transport failed

The above exception was the direct cause of the following exception:

Exception                                 Traceback (most recent call last)
Input In [8], in <cell line: 20>()
     14 model.CLASSES = datasets[0].CLASSES
     16 # Create work_dir
     17 # mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
---> 20 train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
     21                 meta=dict())

File /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/apis/train.py:194, in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
    192 elif cfg.load_from:
    193     runner.load_checkpoint(cfg.load_from)
--> 194 runner.run(data_loaders, cfg.workflow)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py:126, in IterBasedRunner.run(self, data_loaders, workflow, max_iters, **kwargs)
    122 self.logger.info('Hooks will be executed in the following order:\n%s',
    123                  self.get_hook_info())
    124 self.logger.info('workflow: %s, max: %d iters', workflow,
    125                  self._max_iters)
--> 126 self.call_hook('before_run')
    128 iter_loaders = [IterLoader(x) for x in data_loaders]
    130 self.call_hook('before_epoch')

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/base_runner.py:317, in BaseRunner.call_hook(self, fn_name)
    310 """Call all hooks.
    311 
    312 Args:
    313     fn_name (str): The function name in each hook to be called, such as
    314         "before_train_epoch".
    315 """
    316 for hook in self._hooks:
--> 317     getattr(hook, fn_name)(self)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/dist_utils.py:135, in master_only.<locals>.wrapper(*args, **kwargs)
    133 rank, _ = get_dist_info()
    134 if rank == 0:
--> 135     return func(*args, **kwargs)

File /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/core/hook/wandblogger_hook.py:106, in MMSegWandbHook.before_run(self, runner)
    104 @master_only
    105 def before_run(self, runner):
--> 106     super(MMSegWandbHook, self).before_run(runner)
    108     # Check if EvalHook and CheckpointHook are available.
    109     for hook in runner.hooks:

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/dist_utils.py:135, in master_only.<locals>.wrapper(*args, **kwargs)
    133 rank, _ = get_dist_info()
    134 if rank == 0:
--> 135     return func(*args, **kwargs)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/hooks/logger/wandb.py:83, in WandbLoggerHook.before_run(self, runner)
     81     self.import_wandb()
     82 if self.init_kwargs:
---> 83     self.wandb.init(**self.init_kwargs)  # type: ignore
     84 else:
     85     self.wandb.init()

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:1170, in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
   1168         if except_exit:
   1169             os._exit(1)
-> 1170         raise Exception("problem") from error_seen
   1171 return run

Exception: problem

raphael-sanandres · February 16, 2023, 12:41am

Hello Vardhan!

In order to get idea of what the issue may be, could you provide me with your debug.log and debug-internal.log for this specific run? They should be located in the wandb folder in your computer’s working directory. That folder has folders formatted as run-DATETIME-ID - each of which is associated with an individual run.

raphael-sanandres · February 22, 2023, 5:13pm

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

raphael-sanandres · February 27, 2023, 11:01pm

Hi Vardhan, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · April 14, 2023, 10:37pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
MailboxError: transport failed when doing wandb.init() in azure ml W&B Help	4	1618	September 4, 2023
Wandb: ERROR Internal wandb error: file data was not synced wandb: ERROR transport failed W&B Help	6	1494	January 18, 2024
Agent bug? File not found error W&B Help sweeps , wandb	11	5522	May 31, 2022
Wait for a long to get the result W&B Help wandb	7	688	October 3, 2023
Is wandb suppose to get pycharm stuck? W&B Help	1	581	October 1, 2021

Trying to run wandb on azure ml, running into issues

Related topics