Traceback (most recent call last):
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1133, in init
run = wi.init()
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 787, in init
run_start_result = run_start_handle.wait(timeout=30)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 271, in wait
raise MailboxError("transport failed")
wandb.errors.MailboxError: transport failed
wandb: ERROR Abnormal program exit
2023-02-13 22:32:43,972 - mmseg - INFO - Loaded 20000 images
/mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py:235: UserWarning: Default ``avg_non_ignore`` is False, if you would like to ignore the certain label and average loss over non-ignore labels, which is the same with PyTorch official cross_entropy, set ``avg_non_ignore=True``.
warnings.warn(
2023-02-13 22:32:52,439 - mmseg - INFO - Loaded 2500 images
2023-02-13 22:32:52,458 - mmseg - INFO - Start running, host: azureuser@vardhan-cvml, work_dir: /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/work_dirs/logs/deeplabv3plus
2023-02-13 22:32:52,459 - mmseg - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) PolyLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) MMSegWandbHook
--------------------
before_train_epoch:
(VERY_HIGH ) PolyLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) MMSegWandbHook
--------------------
before_train_iter:
(VERY_HIGH ) PolyLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook
--------------------
after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) MMSegWandbHook
--------------------
after_train_epoch:
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) MMSegWandbHook
--------------------
before_val_epoch:
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) MMSegWandbHook
--------------------
before_val_iter:
(LOW ) IterTimerHook
--------------------
after_val_iter:
(LOW ) IterTimerHook
--------------------
after_val_epoch:
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) MMSegWandbHook
--------------------
after_run:
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) MMSegWandbHook
--------------------
2023-02-13 22:32:52,460 - mmseg - INFO - workflow: [('train', 1)], max: 50000 iters
2023-02-13 22:32:52,460 - mmseg - INFO - Checkpoints will be saved to /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/work_dirs/logs/deeplabv3plus by HardDiskBackend.
2023-02-13 22:32:52.816987: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 22:32:59.646354: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/cv2/../../lib64:
2023-02-13 22:32:59.646501: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/cv2/../../lib64:
2023-02-13 22:32:59.646517: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
wandb: Currently logged in as: don_v. Use `wandb login --relogin` to force relogin
Thread HandlerThread:
Traceback (most recent call last):
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 49, in run
self._run()
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 100, in _run
self._process(record)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 280, in _process
self._hm.handle(record)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 136, in handle
handler(record)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 146, in handle_request
handler(record)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 695, in handle_request_run_start
self._system_monitor.probe(publish=True)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_monitor.py", line 186, in probe
self.system_info.publish(system_info)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 252, in publish
self._save_patches()
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 134, in _save_patches
if self.git.dirty:
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/git.py", line 76, in dirty
return self.repo.is_dirty()
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/repo/base.py", line 795, in is_dirty
if osp.isfile(self.index.path) and len(self.git.diff("--cached", *default_args)):
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 696, in <lambda>
return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 1270, in _call_process
return self.execute(call, **exec_kwargs)
File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/git/cmd.py", line 1064, in execute
raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(129)
cmdline: git diff --cached --abbrev=40 --full-index --raw
stderr: 'error: unknown option `cached'
usage: git diff --no-index [<options>] <path> <path>
Diff output format options
-p, --patch generate patch
-s, --no-patch suppress diff output
-u generate patch
-U, --unified[=<n>] generate diffs with <n> lines context
-W, --function-context
generate diffs with <n> lines context
--raw generate the diff in raw format
--patch-with-raw synonym for '-p --raw'
--patch-with-stat synonym for '-p --stat'
--numstat machine friendly --stat
--shortstat output only the last line of --stat
-X, --dirstat[=<param1,param2>...]
output the distribution of relative amount of changes for each sub-directory
--cumulative synonym for --dirstat=cumulative
--dirstat-by-file[=<param1,param2>...]
synonym for --dirstat=files,param1,param2...
--check warn if changes introduce conflict markers or whitespace errors
--summary condensed summary such as creations, renames and mode changes
--name-only show only names of changed files
--name-status show only names and status of changed files
--stat[=<width>[,<name-width>[,<count>]]]
generate diffstat
--stat-width <width> generate diffstat with a given width
--stat-name-width <width>
generate diffstat with a given name width
--stat-graph-width <width>
generate diffstat with a given graph width
--stat-count <count> generate diffstat with limited lines
--compact-summary generate compact summary in diffstat
--binary output a binary diff that can be applied
--full-index show full pre- and post-image object names on the "index" lines
--color[=<when>] show colored diff
--ws-error-highlight <kind>
highlight whitespace errors in the 'context', 'old' or 'new' lines in the diff
-z do not munge pathnames and use NULs as output field terminators in --raw or --numstat
--abbrev[=<n>] use <n> digits to display object names
--src-prefix <prefix>
show the given source prefix instead of "a/"
--dst-prefix <prefix>
show the given destination prefix instead of "b/"
--line-prefix <prefix>
prepend an additional prefix to every line of output
--no-prefix do not show any source or destination prefix
--inter-hunk-context <n>
show context between diff hunks up to the specified number of lines
--output-indicator-new <char>
specify the character to indicate a new line instead of '+'
--output-indicator-old <char>
specify the character to indicate an old line instead of '-'
--output-indicator-context <char>
specify the character to indicate a context instead of ' '
Diff rename options
-B, --break-rewrites[=<n>[/<m>]]
break complete rewrite changes into pairs of delete and create
-M, --find-renames[=<n>]
detect renames
-D, --irreversible-delete
omit the preimage for deletes
-C, --find-copies[=<n>]
detect copies
--find-copies-harder use unmodified files as source to find copies
--no-renames disable rename detection
--rename-empty use empty blobs as rename source
--follow continue listing the history of a file beyond renames
-l <n> prevent rename/copy detection if the number of rename/copy targets exceeds given limit
Diff algorithm options
--minimal produce the smallest possible diff
-w, --ignore-all-space
ignore whitespace when comparing lines
-b, --ignore-space-change
ignore changes in amount of whitespace
--ignore-space-at-eol
ignore changes in whitespace at EOL
--ignore-cr-at-eol ignore carrier-return at the end of line
--ignore-blank-lines ignore changes whose lines are all blank
-I, --ignore-matching-lines <regex>
ignore changes whose all lines match <regex>
--indent-heuristic heuristic to shift diff hunk boundaries for easy reading
--patience generate diff using the "patience diff" algorithm
--histogram generate diff using the "histogram diff" algorithm
--diff-algorithm <algorithm>
choose a diff algorithm
--anchored <text> generate diff using the "anchored diff" algorithm
--word-diff[=<mode>] show word diff, using <mode> to delimit changed words
--word-diff-regex <regex>
use <regex> to decide what a word is
--color-words[=<regex>]
equivalent to --word-diff=color --word-diff-regex=<regex>
--color-moved[=<mode>]
moved lines of code are colored differently
--color-moved-ws <mode>
how white spaces are ignored in --color-moved
Other diff options
--relative[=<prefix>]
when run from subdir, exclude changes outside and show relative paths
-a, --text treat all files as text
-R swap two inputs, reverse the diff
--exit-code exit with 1 if there were differences, 0 otherwise
--quiet disable all output of the program
--ext-diff allow an external diff helper to be executed
--textconv run external text conversion filters when comparing binary files
--ignore-submodules[=<when>]
ignore changes to submodules in the diff generation
--submodule[=<format>]
specify how differences in submodules are shown
--ita-invisible-in-index
hide 'git add -N' entries from the index
--ita-visible-in-index
treat 'git add -N' entries as real in the index
-S <string> look for differences that change the number of occurrences of the specified string
-G <regex> look for differences that change the number of occurrences of the specified regex
--pickaxe-all show all changes in the changeset with -S or -G
--pickaxe-regex treat <string> in -S as extended POSIX regular expression
-O <file> control the order in which files appear in the output
--rotate-to <path> show the change in the specified path first
--skip-to <path> skip the output to the specified path
--find-object <object-id>
look for differences that change the number of occurrences of the specified object
--diff-filter [(A|C|D|M|R|T|U|X|B)...[*]]
select files by diff type
--output <file> output to a specific file
'
wandb: ERROR Internal wandb error: file data was not synced
Problem at: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/hooks/logger/wandb.py 83 before_run
---------------------------------------------------------------------------
MailboxError Traceback (most recent call last)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:1133, in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
1132 try:
-> 1133 run = wi.init()
1134 except_exit = wi.settings._except_exit
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:787, in _WandbInit.init(self)
786 # TODO: add progress to let user know we are doing something
--> 787 run_start_result = run_start_handle.wait(timeout=30)
788 if run_start_result is None:
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py:271, in MailboxHandle.wait(self, timeout, on_probe, on_progress, release, cancel)
270 if self._interface._transport_keepalive_failed():
--> 271 raise MailboxError("transport failed")
273 found, abandoned = self._slot._get_and_clear(timeout=wait_timeout)
MailboxError: transport failed
The above exception was the direct cause of the following exception:
Exception Traceback (most recent call last)
Input In [8], in <cell line: 20>()
14 model.CLASSES = datasets[0].CLASSES
16 # Create work_dir
17 # mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
---> 20 train_segmentor(model, datasets, cfg, distributed=False, validate=True,
21 meta=dict())
File /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/apis/train.py:194, in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
192 elif cfg.load_from:
193 runner.load_checkpoint(cfg.load_from)
--> 194 runner.run(data_loaders, cfg.workflow)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py:126, in IterBasedRunner.run(self, data_loaders, workflow, max_iters, **kwargs)
122 self.logger.info('Hooks will be executed in the following order:\n%s',
123 self.get_hook_info())
124 self.logger.info('workflow: %s, max: %d iters', workflow,
125 self._max_iters)
--> 126 self.call_hook('before_run')
128 iter_loaders = [IterLoader(x) for x in data_loaders]
130 self.call_hook('before_epoch')
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/base_runner.py:317, in BaseRunner.call_hook(self, fn_name)
310 """Call all hooks.
311
312 Args:
313 fn_name (str): The function name in each hook to be called, such as
314 "before_train_epoch".
315 """
316 for hook in self._hooks:
--> 317 getattr(hook, fn_name)(self)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/dist_utils.py:135, in master_only.<locals>.wrapper(*args, **kwargs)
133 rank, _ = get_dist_info()
134 if rank == 0:
--> 135 return func(*args, **kwargs)
File /mnt/batch/tasks/shared/LS_root/mounts/clusters/vardhan-cvml/code/Users/Vardhan.Dongre/mmsegmentation/mmseg/core/hook/wandblogger_hook.py:106, in MMSegWandbHook.before_run(self, runner)
104 @master_only
105 def before_run(self, runner):
--> 106 super(MMSegWandbHook, self).before_run(runner)
108 # Check if EvalHook and CheckpointHook are available.
109 for hook in runner.hooks:
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/dist_utils.py:135, in master_only.<locals>.wrapper(*args, **kwargs)
133 rank, _ = get_dist_info()
134 if rank == 0:
--> 135 return func(*args, **kwargs)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mmcv/runner/hooks/logger/wandb.py:83, in WandbLoggerHook.before_run(self, runner)
81 self.import_wandb()
82 if self.init_kwargs:
---> 83 self.wandb.init(**self.init_kwargs) # type: ignore
84 else:
85 self.wandb.init()
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/wandb/sdk/wandb_init.py:1170, in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
1168 if except_exit:
1169 os._exit(1)
-> 1170 raise Exception("problem") from error_seen
1171 return run
Exception: problem
Hello Vardhan!
In order to get idea of what the issue may be, could you provide me with your debug.log
and debug-internal.log
for this specific run? They should be located in the wandb
folder in your computer’s working directory. That folder has folders formatted as run-DATETIME-ID
- each of which is associated with an individual run.
Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.
Hi Vardhan, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!