Wandb: ERROR Internal wandb error: file data was not synced wandb: ERROR transport failed

Hello there,

I am trying to experiment with wandb local on an AWS EC2 instance but any experiments immediately fails with

wandb: ERROR Internal wandb error: file data was not synced
 wandb: ERROR transport failed
...
File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 281, in wait
 raise MailboxError("transport failed")
 wandb.sdk.lib.mailbox.MailboxError: transport failed
 wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe"

My code is a simple toy pytorch lightning example where I try to double log to mlflow and wandb local (the goal is really to compare the experience) but I got the error as soon as I activate the wandb logger. I tried several things like downgrading protobuf as seen in another post but nothing works… it’s a complete blocker for me but in case someone has some idea ?

Note that the run is correctly created in my wandb local but no metric is logged.

WanDB version 0.16.1

Hi @thomas-ricatte ,

Thank you for reaching out for support. I’ll be happy to assist you with this. Aside from the partial trace back, can you also provide the following for me please?

  • Full Traceback Error you are seeing
  • debug.log and debug-internal.log files These files are under your local folder wandb/run-<date>_<time>-<run-id>/logs in the same directory where you’re running your code.
  • A reproducible toy code to mimic the behavior that you are seeing on your end
  • WandB Client Version you are using

Regards,
Carlo Argel

Hello @thomas-ricatte ,
I would like to follow up on the request my colleague asked to further investigate the error that you encountered please.

Debug files and Full Traceback errors. (details mentioned from previous reply)
Also can you please check File Permissions, make sure that the WandB process has the necessary permissions to read and write to your local directory where it’s trying to sync files. If the process is running as a user with restricted permissions, it may not be able to access the files it needs.

Thanks

Hello, thanks for the answer. Let me try to gather the files you requested.

The run is hapenning in a SageMaker training job so I need to see how to export the internal logs from the run. I will check this beginning of next week.

In the meantime, here’s a full tracelog from CLoudWatch

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 49, in run
    self._run()
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 100, in _run
    self._process(record)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 279, in _process
    self._hm.handle(record)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 138, in handle
    handler(record)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 148, in handle_request
    handler(record)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 683, in handle_request_run_start
    self._system_monitor.probe(publish=True)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/system/system_monitor.py", line 228, in probe
    self.system_info.publish(system_info)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 255, in publish
    self._save_patches()
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 146, in _save_patches
    upstream_commit = self.git.get_upstream_fork_point()
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/lib/gitlib.py", line 200, in get_upstream_fork_point
    possible_relatives.append(tracking_branch.commit)
  File "/opt/conda/lib/python3.10/site-packages/git/refs/symbolic.py", line 274, in _get_commit
    obj = self._get_object()
  File "/opt/conda/lib/python3.10/site-packages/git/refs/symbolic.py", line 267, in _get_object
    return Object.new_from_sha(self.repo, hex_to_bin(self.dereference_recursive(self.repo, self.path)))
  File "/opt/conda/lib/python3.10/site-packages/git/objects/base.py", line 94, in new_from_sha
    oinfo = repo.odb.info(sha1)
  File "/opt/conda/lib/python3.10/site-packages/git/db.py", line 40, in info
    hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
  File "/opt/conda/lib/python3.10/site-packages/git/cmd.py", line 1384, in get_object_header
    return self.__get_object_header(cmd, ref)
  File "/opt/conda/lib/python3.10/site-packages/git/cmd.py", line 1370, in __get_object_header
    cmd.stdin.flush()
	Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 49, in run self._run() File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 100, in _run self._process(record) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 279, in _process self._hm.handle(record) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 138, in handle handler(record) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 148, in handle_request handler(record) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/handler.py", line 683, in handle_request_run_start self._system_monitor.probe(publish=True) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/system/system_monitor.py", line 228, in probe self.system_info.publish(system_info) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 255, in publish self._save_patches() File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/internal/system/system_info.py", line 146, in _save_patches upstream_commit = self.git.get_upstream_fork_point() File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/lib/gitlib.py", line 200, in get_upstream_fork_point possible_relatives.append(tracking_branch.commit) File "/opt/conda/lib/python3.10/site-packages/git/refs/symbolic.py", line 274, in _get_commit obj = self._get_object() File "/opt/conda/lib/python3.10/site-packages/git/refs/symbolic.py", line 267, in _get_object return Object.new_from_sha(self.repo, hex_to_bin(self.dereference_recursive(self.repo, self.path))) File "/opt/conda/lib/python3.10/site-packages/git/objects/base.py", line 94, in new_from_sha oinfo = repo.odb.info(sha1) File "/opt/conda/lib/python3.10/site-packages/git/db.py", line 40, in info hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha)) File "/opt/conda/lib/python3.10/site-packages/git/cmd.py", line 1384, in get_object_header return self.__get_object_header(cmd, ref) File "/opt/conda/lib/python3.10/site-packages/git/cmd.py", line 1370, in __get_object_header cmd.stdin.flush()
	2024-01-05T17:32:46.077+01:00	BrokenPipeError: [Errno 32] Broken pipe
	2024-01-05T17:32:46.077+01:00	wandb: ERROR Internal wandb error: file data was not synced
	2024-01-05T17:32:55.079+01:00	Problem at: /opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/wandb.py 399 experiment
	2024-01-05T17:32:55.079+01:00	wandb: ERROR transport failed
	2024-01-05T17:32:55.079+01:00	Traceback (most recent call last):
	2024-01-05T17:32:55.080+01:00
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/train.py", line 92, in <module>
    mnistTrainer.fit(model)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 950, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 86, in _call_setup_hook
    if hasattr(logger, "experiment"):
  File "/opt/conda/lib/python3.10/site-packages/lightning_fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/wandb.py", line 399, in experiment
    self._experiment = wandb.init(**self._wandb_init)
	File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/ml/code/train.py", line 92, in <module> mnistTrainer.fit(model) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 950, in _run call._call_setup_hook(self) # allow user to setup lightning_module in accelerator environment File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 86, in _call_setup_hook if hasattr(logger, "experiment"): File "/opt/conda/lib/python3.10/site-packages/lightning_fabric/loggers/logger.py", line 118, in experiment return fn(self) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/wandb.py", line 399, in experiment self._experiment = wandb.init(**self._wandb_init)
	2024-01-05T17:32:55.080+01:00	File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1189, in init raise e File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1170, in init run = wi.init() File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 811, in init run_start_result = run_start_handle.wait(timeout=30) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 281, in wait raise MailboxError("transport failed")
	2024-01-05T17:32:55.080+01:00
wandb.sdk.lib.mailbox.MailboxError: transport failed
	wandb.sdk.lib.mailbox.MailboxError: transport failed
	2024-01-05T17:32:55.081+01:00
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/train.py", line 92, in <module>
    mnistTrainer.fit(model)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 950, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 86, in _call_setup_hook
    if hasattr(logger, "experiment"):
  File "/opt/conda/lib/python3.10/site-packages/lightning_fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/wandb.py", line 399, in experiment
    self._experiment = wandb.init(**self._wandb_init)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1189, in init
    raise e
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1170, in init
    run = wi.init()
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 811, in init
    run_start_result = run_start_handle.wait(timeout=30)
  File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 281, in wait
    raise MailboxError("transport failed")
	Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/ml/code/train.py", line 92, in <module> mnistTrainer.fit(model) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 950, in _run call._call_setup_hook(self) # allow user to setup lightning_module in accelerator environment File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 86, in _call_setup_hook if hasattr(logger, "experiment"): File "/opt/conda/lib/python3.10/site-packages/lightning_fabric/loggers/logger.py", line 118, in experiment return fn(self) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/wandb.py", line 399, in experiment self._experiment = wandb.init(**self._wandb_init) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1189, in init raise e File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1170, in init run = wi.init() File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 811, in init run_start_result = run_start_handle.wait(timeout=30) File "/opt/conda/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 281, in wait raise MailboxError("transport failed")
	2024-01-05T17:32:55.081+01:00
wandb.sdk.lib.mailbox.MailboxError: transport failed
	wandb.sdk.lib.mailbox.MailboxError: transport failed

Hello @thomas-ricatte , based on the provided logs it seems that it is having an issue communicating with wandB, but until we get the debug logs that is when we can have more clues about the issue. Also you mentioned that you are trying to double log, have you also tried logging them seperately? One with mlflow and one with wandb local? If it is also possible to share your workspace link?

Hi @thomas-ricatte following up with our inquiry above :slight_smile:

Hi @thomas-ricatte , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!