PyTorch Tensorboard Sync in distributed training experiments

Hi there,

I am trying to log my PyTorch training with w&b in a environment with Tensorboard X integration.

The training is performed via the Pointcept Codebase. This code base already has a Tensorbaord integration. To get w&b logging the training, I followed the Quickstart guide and put the wandb.init() at the beginning of the training script (Find Code below).

My Issue:

If I run the training on only one single gpu, w&b has no problem to sync the Tensorboard logs to the w&b dashbord. If I train on more than one gpu, the w&b dashbord creates the run, but with empty charts.

  • In the System tray, it detects some system information. E.g. it detects the GPU utilization of (all) gpus.
  • In the Logs tray, no logs are recognized (this usually works with one gpu)
  • If i try to spin up the Tensorboard instace: “No dashboards are active for the current data set.”

My Code:

adapted from the Pointcept/tools/train.py script:

"""
Main Training Script

Author: Xiaoyang Wu (xiaoyang.wu.cs@gmail.com)
Please cite our work if the code is helpful to you.
"""

from pointcept.engines.defaults import (
    default_argument_parser,
    default_config_parser,
    default_setup,
)
from pointcept.engines.train import TRAINERS
from pointcept.engines.launch import launch

import wandb


def main_worker(cfg):
    cfg = default_setup(cfg)
    trainer = TRAINERS.build(dict(type=cfg.train.type, cfg=cfg))
    trainer.train()


def main():
    args = default_argument_parser().parse_args()
    cfg = default_config_parser(args.config_file, args.options)
    
    wandb_cfg = cfg.pop("wandb", None)
    if wandb_cfg: 
        if wandb_cfg.track: 
            import wandb
            settings = wandb.Settings(disable_git=True)
            wandb.tensorboard.patch(root_logdir=cfg.save_path, save=True, tensorboard_x=True)

            wandb.init(
                project=wandb_cfg.project,
                notes=wandb_cfg.notes,
                tags=wandb_cfg.tags,
                config=cfg,
                sync_tensorboard=True,
                settings=settings
            )

    launch(
        main_worker,
        num_gpus_per_machine=args.num_gpus,
        num_machines=args.num_machines,
        machine_rank=args.machine_rank,
        dist_url=args.dist_url,
        cfg=(cfg,),
    )

    wandb.finish()

if __name__ == "__main__":
    main()

Hi @rauch - Thanks for reaching out with your question!

Would you mind sharing some additional information on your training environment:

  • Are you running this locally or on a cloud platform (if so, which one)? Which GPUs are you using for the training?
  • What version of wandb SDK have you got installed? What other libraries and frameworks do you also have installed?
  • Are you running it through a Jupyter Notebook?
  • Could you share the debug.log and debug-internal.log for the run training on multiple GPUs? These should be in the ./wandb/run-date_time-runid/logs/ folder

This information will help us investigate what could be causing the data not being properly logged in your case.

Thanks!
Francesco

Sorry for missing this important details.

I run it on a private GPU cluster with up to 16 GPUs. The GPUs are NVIDIA V100 SXM3 with 32 GB HBM2 memory.

The training is executed from the terminal inside a docker container, not through a Jupiter Notebook. Python Env see below.

As i said,

Driver Version: 470.161.03
CUDA Version: 11.4

python pip list:

Package                   Version
------------------------- -----------------------
absl-py                   1.4.0
addict                    2.4.0
ansi2html                 1.8.0
anyio                     4.1.0
appdirs                   1.4.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.0.5
astunparse                1.6.3
async-lru                 2.0.4
attrs                     23.1.0
Babel                     2.13.1
backcall                  0.2.0
beautifulsoup4            4.12.2
bleach                    6.1.0
blinker                   1.7.0
boltons                   23.0.0
brotlipy                  0.7.0
cachetools                5.3.1
ccimport                  0.4.2
certifi                   2023.7.22
cffi                      1.15.1
chardet                   4.0.0
charset-normalizer        2.0.4
click                     8.1.7
clip                      1.0
comm                      0.2.0
conda                     23.7.3
conda-build               3.24.0
conda-content-trust       0.1.3
conda-package-handling    2.0.2
conda_package_streaming   0.7.0
ConfigArgParse            1.7
contourpy                 1.2.0
cryptography              39.0.1
cumm-cu117                0.4.11
cycler                    0.12.1
dash                      2.14.2
dash-core-components      2.0.0
dash-html-components      2.0.0
dash-table                5.0.0
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
dnspython                 2.3.0
docker-pycreds            0.4.0
einops                    0.6.1
exceptiongroup            1.1.1
executing                 0.8.3
expecttest                0.1.4
fastjsonschema            2.19.0
filelock                  3.9.0
fire                      0.5.0
flash-attn                2.5.3
Flask                     3.0.0
fonttools                 4.45.1
fqdn                      1.5.1
fsspec                    2023.6.0
ftfy                      6.1.1
gitdb                     4.0.11
GitPython                 3.1.40
glob2                     0.7
gmpy2                     2.1.2
google-auth               2.22.0
google-auth-oauthlib      1.0.0
grpcio                    1.57.0
h5py                      3.9.0
huggingface-hub           0.16.4
hypothesis                6.75.2
idna                      3.4
importlib-metadata        6.8.0
ipykernel                 6.27.1
ipython                   8.12.0
ipywidgets                8.1.1
isoduration               20.11.0
itsdangerous              2.1.2
jedi                      0.18.1
Jinja2                    3.1.2
joblib                    1.3.2
json5                     0.9.14
jsonpatch                 1.32
jsonpointer               2.1
jsonschema                4.20.0
jsonschema-specifications 2023.11.2
jupyter_client            8.6.0
jupyter_core              5.5.0
jupyter-events            0.9.0
jupyter-lsp               2.2.1
jupyter_server            2.11.1
jupyter_server_terminals  0.4.4
jupyterlab                4.0.9
jupyterlab_pygments       0.3.0
jupyterlab_server         2.25.2
jupyterlab-widgets        3.0.9
kiwisolver                1.4.5
lark                      1.1.7
libarchive-c              2.9
lightning-utilities       0.10.0
Markdown                  3.4.4
MarkupSafe                2.1.1
matplotlib                3.8.2
matplotlib-inline         0.1.6
mistune                   3.0.2
mkl-fft                   1.3.6
mkl-random                1.2.2
mkl-service               2.4.0
mpmath                    1.3.0
nbclient                  0.9.0
nbconvert                 7.11.0
nbformat                  5.7.0
nest-asyncio              1.5.8
networkx                  2.8.4
ninja                     1.11.1
notebook_shim             0.2.3
numpy                     1.24.3
oauthlib                  3.2.2
open3d                    0.17.0
overrides                 7.4.0
packaging                 23.0
pandas                    2.1.3
pandocfilters             1.5.0
parso                     0.8.3
pccm                      0.4.8
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    9.4.0
pip                       24.0
pkginfo                   1.9.6
platformdirs              3.10.0
plotly                    5.18.0
pluggy                    1.0.0
plyfile                   1.0.1
pointgroup-ops            0.0.0
pointops                  1.0
portalocker               2.7.0
prometheus-client         0.19.0
prompt-toolkit            3.0.36
protobuf                  4.24.1
psutil                    5.9.0
ptyprocess                0.7.0
pure-eval                 0.2.2
pyasn1                    0.5.0
pyasn1-modules            0.3.0
pybind11                  2.11.1
pycosat                   0.6.4
pycparser                 2.21
Pygments                  2.15.1
pyOpenSSL                 23.0.0
pyparsing                 3.1.1
pyquaternion              0.9.9
PySocks                   1.7.1
python-dateutil           2.8.2
python-etcd               0.4.5
python-json-logger        2.0.7
pytz                      2022.7
PyYAML                    6.0
pyzmq                     25.1.1
referencing               0.31.1
regex                     2023.8.8
requests                  2.31.0
requests-oauthlib         1.3.1
retrying                  1.3.4
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rpds-py                   0.13.2
rsa                       4.9
ruamel.yaml               0.17.21
ruamel.yaml.clib          0.2.6
safetensors               0.3.3
scikit-learn              1.3.0
scipy                     1.11.1
seaborn                   0.13.0
Send2Trash                1.8.2
sentry-sdk                1.38.0
setproctitle              1.3.3
setuptools                65.6.3
SharedArray               3.2.3
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
sortedcontainers          2.4.0
soupsieve                 2.4
spconv-cu117              2.3.6
stack-data                0.2.0
Swin3D                    0.0.0
sympy                     1.11.1
tenacity                  8.2.3
tensorboard               2.14.0
tensorboard-data-server   0.7.1
tensorboardX              2.6.2.2
termcolor                 2.3.0
terminado                 0.18.0
threadpoolctl             3.2.0
timm                      0.9.5
tinycss2                  1.2.1
tomli                     2.0.1
toolz                     0.12.0
torch                     2.0.1
torch-cluster             1.6.1
torch-geometric           2.3.1
torch-scatter             2.1.1
torch-sparse              0.6.17
torchaudio                2.0.2
torchdata                 0.6.1
torchelastic              0.2.2
torchinfo                 1.8.0
torchmetrics              1.2.1
torchtext                 0.15.2
torchvision               0.15.2
tornado                   6.4
tqdm                      4.65.0
traitlets                 5.7.1
triton                    2.0.0
triton-nightly            2.1.0.dev20230822000928
types-dataclasses         0.6.6
types-python-dateutil     2.8.19.14
typing_extensions         4.5.0
tzdata                    2023.3
uri-template              1.3.0
urllib3                   1.26.15
wandb                     0.16.0
wcwidth                   0.2.5
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.6.4
Werkzeug                  3.0.1
wheel                     0.38.4
widgetsnbextension        4.0.9
yapf                      0.40.1
zipp                      3.16.2
zstandard                 0.19.0

W&B Console Output

wandb: Currently logged in as: r**h. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.16.3 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.0
wandb: Run data is saved locally in /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run luminous-orchid-11
wandb: ⭐️ View project at https://wandb.ai/r**h/RB3D%20multi
wandb: 🚀 View run at https://wandb.ai/r**h/RB3D%20multi/runs/v7bgm46o

debug.log

2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_setup.py:_flush():76] Current SDK version is 0.16.0
2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_setup.py:_flush():76] Configure stats pid to 39
2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_setup.py:_flush():76] Loading settings from /home//a21blura/.config/wandb/settings
2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_setup.py:_flush():76] Loading settings from /workspace/Pointcept/wandb/settings
2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'exp/rohbau3d/multi-r3-s3/code/tools/train.py', 'program_abspath': '/workspace/Pointcept/exp/rohbau3d/multi-r3-s3/code/tools/train.py', 'program': '/workspace/Pointcept/exp/rohbau3d/multi-r3-s3/code/tools/train.py'}
2024-02-22 15:13:48,582 INFO    MainThread:39 [wandb_init.py:_log_setup():524] Logging user logs to /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/logs/debug.log
2024-02-22 15:13:48,583 INFO    MainThread:39 [wandb_init.py:_log_setup():525] Logging internal logs to /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/logs/debug-internal.log
2024-02-22 15:13:48,583 INFO    MainThread:39 [wandb_init.py:init():564] calling init triggers
2024-02-22 15:13:48,583 INFO    MainThread:39 [wandb_init.py:init():571] wandb.init called with sweep_config: {}
config: {'_cfg_dict': {'weight': None, 'resume': False, 'evaluate': True, 'test_only': False, 'seed': 45251091, 'save_path': 'exp/rohbau3d/multi-r3-s3', 'num_worker': 24, 'batch_size': 16, 'batch_size_val': None, 'batch_size_test': None, 'epoch': 20, 'eval_epoch': 10, 'sync_bn': False, 'enable_amp': True, 'empty_cache': False, 'find_unused_parameters': True, 'mix_prob': 0.8, 'param_dicts': None, 'hooks': [{'type': 'CheckpointLoader'}, {'type': 'IterationTimer', 'warmup_iter': 2}, {'type': 'InformationWriter'}, {'type': 'SemSegEvaluator'}, {'type': 'CheckpointSaver', 'save_freq': None}, {'type': 'PreciseEvaluator', 'test_last': False}], 'train': {'type': 'MultiDatasetTrainer'}, 'test': {'type': 'SemSegTester', 'verbose': True}, 'model': {'type': 'PPT-v1m1', 'backbone': {'type': 'SpUNet-v1m3', 'in_channels': 6, 'num_classes': 0, 'base_channels': 32, 'context_channels': 256, 'channels': (32, 64, 128, 256, 256, 128, 96, 96), 'layers': (2, 3, 4, 6, 2, 2, 2, 2), 'cls_mode': False, 'conditions': ('Rohbau3D', 'S3DIS'), 'zero_init': False, 'norm_decouple': True, 'norm_adaptive': True, 'norm_affine': True}, 'criteria': [{'type': 'CrossEntropyLoss', 'loss_weight': 1.0, 'ignore_index': -1}], 'backbone_out_channels': 96, 'context_channels': 256, 'conditions': ('S3DIS', 'Rohbau3D'), 'template': '[x]', 'clip_model': 'ViT-B/16', 'class_name': ('ceiling', 'floor', 'wall', 'beam', 'column', 'window', 'door', 'table', 'chair', 'sofa', 'bookcase', 'board', 'clutter', 'stairs', 'equipment', 'installation'), 'valid_index': ((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), (0, 1, 2, 3, 4, 5, 6, 12, 13, 14, 15)), 'backbone_mode': False}, 'optimizer': {'type': 'SGD', 'lr': 0.05, 'momentum': 0.9, 'weight_decay': 0.0001, 'nesterov': True}, 'scheduler': {'type': 'OneCycleLR', 'max_lr': 0.05, 'pct_start': 0.05, 'anneal_strategy': 'cos', 'div_factor': 10.0, 'final_div_factor': 10000.0}, 'data': {'num_classes': 11, 'ignore_index': -1, 'names': ['clutter', 'ceiling', 'floor', 'wall', 'beam', 'column', 'window', 'door', 'stairs', 'equipment', 'installation'], 'train': {'type': 'ConcatDataset', 'datasets': [{'type': 'S3DISDataset', 'split': ('Area_1', 'Area_2', 'Area_3', 'Area_4', 'Area_5', 'Area_6'), 'data_root': '../data/s3dis', 'transform': [{'type': 'CenterShift', 'apply_z': True}, {'type': 'RandomScale', 'scale': [0.9, 1.1]}, {'type': 'RandomFlip', 'p': 0.5}, {'type': 'RandomJitter', 'sigma': 0.005, 'clip': 0.02}, {'type': 'ChromaticAutoContrast', 'p': 0.2, 'blend_factor': None}, {'type': 'ChromaticTranslation', 'p': 0.95, 'ratio': 0.05}, {'type': 'ChromaticJitter', 'p': 0.95, 'std': 0.05}, {'type': 'GridSample', 'grid_size': 0.04, 'hash_type': 'fnv', 'mode': 'train', 'keys': ('coord', 'color', 'segment'), 'return_grid_coord': True}, {'type': 'SphereCrop', 'point_max': 80000, 'mode': 'random'}, {'type': 'CenterShift', 'apply_z': False}, {'type': 'NormalizeColor'}, {'type': 'ShufflePoint'}, {'type': 'Add', 'keys_dict': {'condition': 'S3DIS'}}, {'type': 'ToTensor'}, {'type': 'Collect', 'keys': ('coord', 'grid_coord', 'segment', 'condition'), 'feat_keys': ('coord', 'color')}], 'test_mode': False, 'loop': 1}, {'type': 'Rohbau3DDataset', 'split': 'train', 'data_root': '../data/rohbau3d', 'transform': [{'type': 'CenterShift', 'apply_z': True}, {'type': 'RandomScale', 'scale': [0.9, 1.1]}, {'type': 'RandomFlip', 'p': 0.5}, {'type': 'RandomJitter', 'sigma': 0.005, 'clip': 0.02}, {'type': 'ChromaticAutoContrast', 'p': 0.2, 'blend_factor': None}, {'type': 'ChromaticTranslation', 'p': 0.95, 'ratio': 0.05}, {'type': 'ChromaticJitter', 'p': 0.95, 'std': 0.05}, {'type': 'GridSample', 'grid_size': 0.04, 'hash_type': 'fnv', 'mode': 'train', 'keys': ('coord', 'color', 'segment'), 'return_grid_coord': True}, {'type': 'SphereCrop', 'point_max': 80000, 'mode': 'random'}, {'type': 'CenterShift', 'apply_z': False}, {'type': 'NormalizeColor'}, {'type': 'ShufflePoint'}, {'type': 'Add', 'keys_dict': {'condition': 'Rohbau3D'}}, {'type': 'ToTensor'}, {'type': 'Collect', 'keys': ('coord', 'grid_coord', 'segment', 'condition'), 'feat_keys': ('coord', 'color')}], 'test_mode': False, 'loop': 1}], 'loop': 2}, 'val': {'type': 'Rohbau3DDataset', 'split': 'val', 'data_root': '../data/rohbau3d', 'transform': [{'type': 'CenterShift', 'apply_z': True}, {'type': 'GridSample', 'grid_size': 0.0333, 'hash_type': 'fnv', 'mode': 'train', 'keys': ('coord', 'color', 'segment'), 'return_grid_coord': True}, {'type': 'CenterShift', 'apply_z': False}, {'type': 'NormalizeColor'}, {'type': 'ToTensor'}, {'type': 'Add', 'keys_dict': {'condition': 'Rohbau3D'}}, {'type': 'Collect', 'keys': ('coord', 'grid_coord', 'segment', 'condition'), 'feat_keys': ('coord', 'color')}], 'test_mode': False}, 'test': {'type': 'Rohbau3DDataset', 'split': 'test', 'data_root': '../data/rohbau3d', 'transform': [{'type': 'CenterShift', 'apply_z': True}, {'type': 'NormalizeColor'}], 'test_mode': True, 'test_cfg': {'voxelize': {'type': 'GridSample', 'grid_size': 0.0333, 'hash_type': 'fnv', 'mode': 'test', 'keys': ('coord', 'color', 'segment'), 'return_grid_coord': True}, 'crop': None, 'post_transform': [{'type': 'CenterShift', 'apply_z': False}, {'type': 'Add', 'keys_dict': {'condition': 'Rohbau3D'}}, {'type': 'ToTensor'}, {'type': 'Collect', 'keys': ('coord', 'grid_coord', 'index', 'condition'), 'feat_keys': ('coord', 'color')}], 'aug_transform': [[{'type': 'RandomRotateTargetAngle', 'angle': [0], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}], [{'type': 'RandomRotateTargetAngle', 'angle': [0.5], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}], [{'type': 'RandomRotateTargetAngle', 'angle': [1], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}], [{'type': 'RandomRotateTargetAngle', 'angle': [1.5], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}], [{'type': 'RandomRotateTargetAngle', 'angle': [0], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [0.95, 0.95]}], [{'type': 'RandomRotateTargetAngle', 'angle': [0.5], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [0.95, 0.95]}], [{'type': 'RandomRotateTargetAngle', 'angle': [1], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [0.95, 0.95]}], [{'type': 'RandomRotateTargetAngle', 'angle': [1.5], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [0.95, 0.95]}], [{'type': 'RandomRotateTargetAngle', 'angle': [0], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [1.05, 1.05]}], [{'type': 'RandomRotateTargetAngle', 'angle': [0.5], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [1.05, 1.05]}], [{'type': 'RandomRotateTargetAngle', 'angle': [1], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [1.05, 1.05]}], [{'type': 'RandomRotateTargetAngle', 'angle': [1.5], 'axis': 'z', 'center': [0, 0, 0], 'p': 1}, {'type': 'RandomScale', 'scale': [1.05, 1.05]}], [{'type': 'RandomFlip', 'p': 1}]]}}}}, '_filename': 'configs/rohbau3d/semseg-multi-r3-s3.py', '_text': '/workspace/Pointcept/configs/_base_/default_runtime.py\nweight = None  # path to model weight\nresume = False  # whether to resume training process\nevaluate = True  # evaluate after each epoch training process\ntest_only = False  # test process\n\nseed = None  # train process will init a random seed and record\nsave_path = "exp/default"\nnum_worker = 16  # total worker in all gpu\nbatch_size = 16  # total batch size in all gpu\nbatch_size_val = None  # auto adapt to bs 1 for each gpu\nbatch_size_test = None  # auto adapt to bs 1 for each gpu\nepoch = 100  # total epoch, data loop = epoch // eval_epoch\neval_epoch = 100  # sche total eval & checkpoint epoch\n\nsync_bn = False\nenable_amp = False\nempty_cache = False\nfind_unused_parameters = False\n\nmix_prob = 0\nparam_dicts = None  # example: param_dicts = [dict(keyword="block", lr_scale=0.1)]\n\n# hook\nhooks = [\n    dict(type="CheckpointLoader"),\n    dict(type="IterationTimer", warmup_iter=2),\n    dict(type="InformationWriter"),\n    dict(type="SemSegEvaluator"),\n    dict(type="CheckpointSaver", save_freq=None),\n    dict(type="PreciseEvaluator", test_last=False),\n]\n\n# Trainer\ntrain = dict(type="DefaultTrainer")\n\n# Tester\ntest = dict(type="SemSegTester", verbose=True)\n\n/workspace/Pointcept/configs/rohbau3d/semseg-multi-r3-s3.py\n_base_ = ["../_base_/default_runtime.py"]\n\n# wandb \nwandb = dict(\n    track = True,\n    project = "RB3D multi",\n    notes = "RUN XXXXX",\n    tags = [],\n)\n\n# misc custom setting\nbatch_size = 16  # bs: total bs in all gpus\nnum_worker = 24\nmix_prob = 0.8\nempty_cache = False\nenable_amp = True\nfind_unused_parameters = True\n\n# trainer\ntrain = dict(\n    type="MultiDatasetTrainer",\n)\n\n# model settings\nmodel = dict(\n    type="PPT-v1m1",\n    backbone=dict(\n        type="SpUNet-v1m3",\n        in_channels=6,\n        num_classes=0,\n        base_channels=32,\n        context_channels=256,\n        channels=(32, 64, 128, 256, 256, 128, 96, 96),\n        layers=(2, 3, 4, 6, 2, 2, 2, 2),\n        cls_mode=False,\n        conditions=( "Rohbau3D", "S3DIS"),\n        zero_init=False,\n        norm_decouple=True,\n        norm_adaptive=True,\n        norm_affine=True,\n    ),\n    criteria=[dict(type="CrossEntropyLoss", loss_weight=1.0, ignore_index=-1)],\n    backbone_out_channels=96,\n    context_channels=256,\n    conditions=("S3DIS", "Rohbau3D"),\n    template="[x]",\n    clip_model="ViT-B/16",\n    class_name=(\'ceiling\', \'floor\', \'wall\', \'beam\', \'column\', \'window\', \'door\', \'table\', \'chair\', \'sofa\', \'bookcase\', \'board\', \'clutter\', \'stairs\', \'equipment\', \'installation\'),\n    valid_index=(\n            (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),\n            (0, 1, 2, 3, 4, 5, 6, 12, 13, 14, 15)\n    ),\n    backbone_mode=False,\n)\n\n# scheduler settings\nepoch = 20\neval_epoch = 10\noptimizer = dict(type="SGD", lr=0.05, momentum=0.9, weight_decay=0.0001, nesterov=True)\nscheduler = dict(\n    type="OneCycleLR",\n    max_lr=optimizer["lr"],\n    pct_start=0.05,\n    anneal_strategy="cos",\n    div_factor=10.0,\n    final_div_factor=10000.0,\n)\n# param_dicts = [dict(keyword="modulation", lr=0.005)]\n\n\n# dataset settings\ndata = dict(\n    num_classes=11,\n    ignore_index=-1,\n    names=[\n        \'clutter\',\n        \'ceiling\',\n        \'floor\',\n        \'wall\',\n        \'beam\',\n        \'column\',\n        \'window\',\n        \'door\',\n        \'stairs\',\n        \'equipment\',\n        \'installation\',\n    ],\n    train=dict(\n        type="ConcatDataset",\n        datasets=[\n            # S3DIS\n            dict(\n                type="S3DISDataset",\n                split=("Area_1", "Area_2", "Area_3", "Area_4", "Area_5", "Area_6"),\n                data_root="../data/s3dis",\n                transform=[\n                    dict(type="CenterShift", apply_z=True),\n                    # dict(type="RandomDropout", dropout_ratio=0.2, dropout_application_ratio=0.2),\n                    # dict(type="RandomRotateTargetAngle", angle=(1/2, 1, 3/2), center=[0, 0, 0], axis="z", p=0.75),\n                    # dict(type="RandomRotate", angle=[-1, 1], axis="z", center=[0, 0, 0], p=0.5),\n                    # dict(type="RandomRotate", angle=[-1 / 64, 1 / 64], axis="x", p=0.5),\n                    # dict(type="RandomRotate", angle=[-1 / 64, 1 / 64], axis="y", p=0.5),\n                    dict(type="RandomScale", scale=[0.9, 1.1]),\n                    # dict(type="RandomShift", shift=[0.2, 0.2, 0.2]),\n                    dict(type="RandomFlip", p=0.5),\n                    dict(type="RandomJitter", sigma=0.005, clip=0.02),\n                    # dict(type="ElasticDistortion", distortion_params=[[0.2, 0.4], [0.8, 1.6]]),\n                    dict(type="ChromaticAutoContrast", p=0.2, blend_factor=None),\n                    dict(type="ChromaticTranslation", p=0.95, ratio=0.05),\n                    dict(type="ChromaticJitter", p=0.95, std=0.05),\n                    # dict(type="HueSaturationTranslation", hue_max=0.2, saturation_max=0.2),\n                    # dict(type="RandomColorDrop", p=0.2, color_augment=0.0),\n                    dict(\n                        type="GridSample",\n                        grid_size=0.04,\n                        hash_type="fnv",\n                        mode="train",\n                        keys=("coord", "color", "segment"),\n                        return_grid_coord=True,\n                    ),\n                    dict(type="SphereCrop", point_max=80000, mode="random"),\n                    dict(type="CenterShift", apply_z=False),\n                    dict(type="NormalizeColor"),\n                    dict(type="ShufflePoint"),\n                    dict(type="Add", keys_dict={"condition": "S3DIS"}),\n                    dict(type="ToTensor"),\n                    dict(\n                        type="Collect",\n                        keys=("coord", "grid_coord", "segment", "condition"),\n                        feat_keys=("coord", "color"),\n                    ),\n                ],\n                test_mode=False,\n                loop=1,  # sampling weight\n            ),\n            # Rohbau3D\n            dict(\n                type="Rohbau3DDataset",\n                split="train",\n                data_root="../data/rohbau3d",\n                transform=[\n                    dict(type="CenterShift", apply_z=True),\n                    # dict(type="RandomDropout", dropout_ratio=0.2, dropout_application_ratio=0.2),\n                    # dict(type="RandomRotateTargetAngle", angle=(1/2, 1, 3/2), center=[0, 0, 0], axis="z", p=0.75),\n                    # dict(type="RandomRotate", angle=[-1, 1], axis="z", center=[0, 0, 0], p=0.5),\n                    # dict(type="RandomRotate", angle=[-1 / 64, 1 / 64], axis="x", p=0.5),\n                    # dict(type="RandomRotate", angle=[-1 / 64, 1 / 64], axis="y", p=0.5),\n                    dict(type="RandomScale", scale=[0.9, 1.1]),\n                    # dict(type="RandomShift", shift=[0.2, 0.2, 0.2]),\n                    dict(type="RandomFlip", p=0.5),\n                    dict(type="RandomJitter", sigma=0.005, clip=0.02),\n                    # dict(type="ElasticDistortion", distortion_params=[[0.2, 0.4], [0.8, 1.6]]),\n                    dict(type="ChromaticAutoContrast", p=0.2, blend_factor=None),\n                    dict(type="ChromaticTranslation", p=0.95, ratio=0.05),\n                    dict(type="ChromaticJitter", p=0.95, std=0.05),\n                    # dict(type="HueSaturationTranslation", hue_max=0.2, saturation_max=0.2),\n                    # dict(type="RandomColorDrop", p=0.2, color_augment=0.0),\n                    dict(\n                        type="GridSample",\n                        grid_size=0.04,\n                        hash_type="fnv",\n                        mode="train",\n                        keys=("coord", "color", "segment"),\n                        return_grid_coord=True,\n                    ),\n                    dict(type="SphereCrop", point_max=80000, mode="random"),\n                    dict(type="CenterShift", apply_z=False),\n                    dict(type="NormalizeColor"),\n                    dict(type="ShufflePoint"),\n                    dict(type="Add", keys_dict={"condition": "Rohbau3D"}),\n                    dict(type="ToTensor"),\n                    dict(\n                        type="Collect",\n                        keys=("coord", "grid_coord", "segment", "condition"),\n                        feat_keys=("coord", "color"),\n                    ),\n                ],\n                test_mode=False,\n                loop=1,  # sampling weight\n            ),\n        ],\n    ),\n    val=dict(\n        type="Rohbau3DDataset",\n        split="val",\n        data_root="../data/rohbau3d",\n        transform=[\n            dict(type="CenterShift", apply_z=True),\n            dict(\n                type="GridSample",\n                grid_size=0.0333,\n                hash_type="fnv",\n                mode="train",\n                keys=("coord", "color", "segment"),\n                return_grid_coord=True,\n            ),\n            # dict(type="SphereCrop", point_max=1000000, mode="center"),\n            dict(type="CenterShift", apply_z=False),\n            dict(type="NormalizeColor"),\n            dict(type="ToTensor"),\n            dict(type="Add", keys_dict={"condition": "Rohbau3D"}),\n            dict(\n                type="Collect",\n                keys=("coord", "grid_coord", "segment", "condition"),\n                feat_keys=("coord", "color"),\n            ),\n        ],\n        test_mode=False,\n    ),\n    test=dict(\n        type="Rohbau3DDataset",\n        split="test",\n        data_root="../data/rohbau3d",\n        transform=[\n            dict(type="CenterShift", apply_z=True),\n            dict(type="NormalizeColor"),\n        ],\n        test_mode=True,\n        test_cfg=dict(\n            voxelize=dict(\n                type="GridSample",\n                grid_size=0.0333,\n                hash_type="fnv",\n                mode="test",\n                keys=("coord", "color", "segment"),\n                return_grid_coord=True,\n            ),\n            crop=None,\n            post_transform=[\n                dict(type="CenterShift", apply_z=False),\n                dict(type="Add", keys_dict={"condition": "Rohbau3D"}),\n                dict(type="ToTensor"),\n                dict(\n                    type="Collect",\n                    keys=("coord", "grid_coord", "index", "condition"),\n                    feat_keys=("coord", "color"),\n                ),\n            ],\n            aug_transform=[\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[0],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    )\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[1 / 2],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    )\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[1],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    )\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[3 / 2],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    )\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[0],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[0.95, 0.95]),\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[1 / 2],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[0.95, 0.95]),\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[1],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[0.95, 0.95]),\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[3 / 2],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[0.95, 0.95]),\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[0],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[1.05, 1.05]),\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[1 / 2],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[1.05, 1.05]),\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[1],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[1.05, 1.05]),\n                ],\n                [\n                    dict(\n                        type="RandomRotateTargetAngle",\n                        angle=[3 / 2],\n                        axis="z",\n                        center=[0, 0, 0],\n                        p=1,\n                    ),\n                    dict(type="RandomScale", scale=[1.05, 1.05]),\n                ],\n                [dict(type="RandomFlip", p=1)],\n            ],\n        ),\n    ),\n)\n'}
2024-02-22 15:13:48,587 INFO    MainThread:39 [wandb_init.py:init():614] starting backend
2024-02-22 15:13:48,587 INFO    MainThread:39 [wandb_init.py:init():618] setting up manager
2024-02-22 15:13:48,589 INFO    MainThread:39 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-02-22 15:13:48,590 INFO    MainThread:39 [wandb_init.py:init():624] backend started and connected
2024-02-22 15:13:48,598 INFO    MainThread:39 [wandb_init.py:init():716] updated telemetry
2024-02-22 15:13:48,598 INFO    MainThread:39 [wandb_init.py:init():749] communicating run to backend with 90.0 second timeout
2024-02-22 15:13:49,187 INFO    MainThread:39 [wandb_run.py:_on_init():2254] communicating current version
2024-02-22 15:13:49,267 INFO    MainThread:39 [wandb_run.py:_on_init():2263] got version response upgrade_message: "wandb version 0.16.3 is available!  To upgrade, please run:\n $ pip install wandb --upgrade"

2024-02-22 15:13:49,267 INFO    MainThread:39 [wandb_init.py:init():800] starting run threads in backend
2024-02-22 15:14:07,905 INFO    MainThread:39 [wandb_run.py:_console_start():2233] atexit reg
2024-02-22 15:14:07,905 INFO    MainThread:39 [wandb_run.py:_redirect():2088] redirect: wrap_raw
2024-02-22 15:14:07,906 INFO    MainThread:39 [wandb_run.py:_redirect():2153] Wrapping output streams.
2024-02-22 15:14:07,906 INFO    MainThread:39 [wandb_run.py:_redirect():2178] Redirects installed.
2024-02-22 15:14:07,906 INFO    MainThread:39 [wandb_init.py:init():841] run started, returning control to user process

debug-internal.log

Only first 100 lines, due to character limitation…

2024-02-22 15:13:48,592 INFO    StreamThr :48 [internal.py:wandb_internal():86] W&B internal server running at pid: 48, started at: 2024-02-22 15:13:48.590647
2024-02-22 15:13:48,595 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status
2024-02-22 15:13:48,599 INFO    WriterThread:48 [datastore.py:open_for_write():85] open: /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/run-v7bgm46o.wandb
2024-02-22 15:13:48,633 DEBUG   SenderThread:48 [sender.py:send():380] send: header
2024-02-22 15:13:48,633 DEBUG   SenderThread:48 [sender.py:send():380] send: run
2024-02-22 15:13:49,169 INFO    SenderThread:48 [dir_watcher.py:__init__():211] watching files in: /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/files
2024-02-22 15:13:49,169 INFO    SenderThread:48 [sender.py:_start_run_threads():1124] run started: v7bgm46o with start time 1708611228.590524
2024-02-22 15:13:49,187 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: check_version
2024-02-22 15:13:49,188 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: check_version
2024-02-22 15:13:49,273 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: run_start
2024-02-22 15:13:49,347 DEBUG   HandlerThread:48 [system_info.py:__init__():32] System info init
2024-02-22 15:13:49,347 DEBUG   HandlerThread:48 [system_info.py:__init__():47] System info init done
2024-02-22 15:13:49,347 INFO    HandlerThread:48 [system_monitor.py:start():194] Starting system monitor
2024-02-22 15:13:49,347 INFO    SystemMonitor:48 [system_monitor.py:_start():158] Starting system asset monitoring threads
2024-02-22 15:13:49,347 INFO    HandlerThread:48 [system_monitor.py:probe():214] Collecting system info
2024-02-22 15:13:49,347 INFO    SystemMonitor:48 [interfaces.py:start():190] Started cpu monitoring
2024-02-22 15:13:49,348 INFO    SystemMonitor:48 [interfaces.py:start():190] Started disk monitoring
2024-02-22 15:13:49,349 INFO    SystemMonitor:48 [interfaces.py:start():190] Started gpu monitoring
2024-02-22 15:13:49,349 INFO    SystemMonitor:48 [interfaces.py:start():190] Started memory monitoring
2024-02-22 15:13:49,350 INFO    SystemMonitor:48 [interfaces.py:start():190] Started network monitoring
2024-02-22 15:13:49,438 DEBUG   HandlerThread:48 [system_info.py:probe():196] Probing system
2024-02-22 15:13:49,438 DEBUG   HandlerThread:48 [system_info.py:probe():244] Probing system done
2024-02-22 15:13:49,438 DEBUG   HandlerThread:48 [system_monitor.py:probe():223] {'os': 'Linux-5.4.0-137-generic-x86_64-with-glibc2.31', 'python': '3.10.11', 'heartbeatAt': '2024-02-22T14:13:49.438588', 'startedAt': '2024-02-22T14:13:48.579654', 'docker': None, 'cuda': None, 'args': ('--config-file', 'configs/rohbau3d/semseg-multi-r3-s3.py', '--num-gpus', '2', '--options', 'save_path=exp/rohbau3d/multi-r3-s3'), 'state': 'running', 'program': '/workspace/Pointcept/exp/rohbau3d/multi-r3-s3/code/tools/train.py', 'codePathLocal': 'exp/rohbau3d/multi-r3-s3/code/tools/train.py', 'codePath': 'exp/rohbau3d/multi-r3-s3/code/tools/train.py', 'host': '18b28e29f4d7', 'username': 'a21blura', 'executable': '/opt/conda/bin/python', 'cpu_count': 48, 'cpu_count_logical': 96, 'cpu_freq': {'current': 3.3853020833333325, 'min': 1200.0, 'max': 3700.0}, 'cpu_freq_per_core': [{'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.179, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 2.999, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.155, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.0, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.4, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}, {'current': 3.399, 'min': 1200.0, 'max': 3700.0}], 'disk': {'/': {'total': 878.4978866577148, 'used': 686.9209938049316}}, 'gpu': 'Tesla V100-SXM3-32GB', 'gpu_count': 2, 'gpu_devices': [{'name': 'Tesla V100-SXM3-32GB', 'memory_total': 34089730048}, {'name': 'Tesla V100-SXM3-32GB', 'memory_total': 34089730048}], 'memory': {'total': 1510.5630111694336}}
2024-02-22 15:13:49,442 INFO    HandlerThread:48 [system_monitor.py:probe():224] Finished collecting system info
2024-02-22 15:13:49,442 INFO    HandlerThread:48 [system_monitor.py:probe():227] Publishing system info
2024-02-22 15:13:49,442 DEBUG   HandlerThread:48 [system_info.py:_save_pip():52] Saving list of pip packages installed into the current environment
2024-02-22 15:13:49,443 DEBUG   HandlerThread:48 [system_info.py:_save_pip():68] Saving pip packages done
2024-02-22 15:13:49,443 DEBUG   HandlerThread:48 [system_info.py:_save_conda():75] Saving list of conda packages installed into the current environment
2024-02-22 15:13:50,171 INFO    Thread-12 :48 [dir_watcher.py:_on_file_created():271] file/dir created: /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/files/conda-environment.yaml
2024-02-22 15:13:50,171 INFO    Thread-12 :48 [dir_watcher.py:_on_file_created():271] file/dir created: /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/files/requirements.txt
2024-02-22 15:14:07,889 DEBUG   HandlerThread:48 [system_info.py:_save_conda():87] Saving conda packages done
2024-02-22 15:14:07,891 INFO    HandlerThread:48 [system_monitor.py:probe():229] Finished publishing system info
2024-02-22 15:14:07,897 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:07,898 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: keepalive
2024-02-22 15:14:07,898 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:07,898 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: keepalive
2024-02-22 15:14:07,898 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:07,898 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: keepalive
2024-02-22 15:14:07,899 DEBUG   SenderThread:48 [sender.py:send():380] send: files
2024-02-22 15:14:07,899 INFO    SenderThread:48 [sender.py:_save_file():1380] saving file wandb-metadata.json with policy now
2024-02-22 15:14:07,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:14:07,906 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:14:07,908 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:14:08,175 INFO    Thread-12 :48 [dir_watcher.py:_on_file_modified():288] file/dir modified: /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/files/conda-environment.yaml
2024-02-22 15:14:08,175 INFO    Thread-12 :48 [dir_watcher.py:_on_file_created():271] file/dir created: /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/files/wandb-metadata.json
2024-02-22 15:14:08,179 DEBUG   SenderThread:48 [sender.py:send():380] send: telemetry
2024-02-22 15:14:08,442 INFO    wandb-upload_0:48 [upload_job.py:push():131] Uploaded file /tmp/tmp6i4ogckzwandb/gf2omtuz-wandb-metadata.json
2024-02-22 15:14:10,180 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:15,180 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:20,186 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:21,179 INFO    Thread-12 :48 [dir_watcher.py:_on_file_modified():288] file/dir modified: /workspace/Pointcept/wandb/run-20240222_151348-v7bgm46o/files/config.yaml
2024-02-22 15:14:22,905 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:14:22,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:14:22,906 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:14:26,115 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:31,116 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:36,116 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:37,905 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:14:37,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:14:37,906 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:14:42,108 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:47,108 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:49,350 DEBUG   SystemMonitor:48 [system_monitor.py:_start():172] Starting system metrics aggregation loop
2024-02-22 15:14:49,352 DEBUG   SenderThread:48 [sender.py:send():380] send: stats
2024-02-22 15:14:52,354 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:14:52,905 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:14:52,906 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:14:52,953 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:14:58,103 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:03,104 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:07,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:15:07,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:15:07,906 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:15:08,138 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:13,139 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:18,140 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:19,354 DEBUG   SenderThread:48 [sender.py:send():380] send: stats
2024-02-22 15:15:22,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:15:22,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:15:22,907 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:15:23,154 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:28,154 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:33,159 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:37,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:15:37,906 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:15:37,949 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:15:38,161 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:43,162 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:48,163 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:49,358 DEBUG   SenderThread:48 [sender.py:send():380] send: stats
2024-02-22 15:15:52,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:15:52,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:15:52,907 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:15:54,070 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:15:59,071 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:16:04,072 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report
2024-02-22 15:16:07,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: stop_status
2024-02-22 15:16:07,906 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: internal_messages
2024-02-22 15:16:07,907 DEBUG   SenderThread:48 [sender.py:send_request():407] send_request: stop_status
2024-02-22 15:16:09,076 DEBUG   HandlerThread:48 [handler.py:handle_request():146] handle_request: status_report

Hi @rauch - thank you for your patience while we investigate this. I have not escalated internally and will keep you posted with any updates.

One more piece of information which would be useful to know - did you try running the same training job without logging to W&B on multiple GPUs, and would you be able to visualise the logged data?