Saving error stopping sweep

Hi, I am trying to run the sweep for tensorflow but it keeps stopping with a broken pipe error. When I checked the logs, I get this traceback error message.

Traceback (most recent call last):
File “/nobackup/eeerog/./general_training_model_5classes.py”, line 713, in
wandb.agent(sweep_id = sweep_id, function=train(class_weightsh, class_weightsv, class_weightsc, train_datagen, valid_datagen, nom_classes1 = 5, nom_classes2 = 5))
File “/nobackup/eeerog/./general_training_model_5classes.py”, line 708, in train
model.fit(train_datagen,epochs=3,steps_per_epoch=860, callbacks = [early_stopping,reduce_lr, model_checkpoint_callback, WandbCallback()], validation_data=valid_datagen, validation_steps = 10)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/wandb/integration/keras/keras.py”, line 174, in new_v2
return old_v2(*args, **kwargs)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/wandb/integration/keras/keras.py”, line 174, in new_v2
return old_v2(*args, **kwargs)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/wandb/integration/keras/keras.py”, line 174, in new_v2
return old_v2(*args, **kwargs)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py”, line 1145, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/keras/callbacks.py”, line 428, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/keras/callbacks.py”, line 1344, in on_epoch_end
self._save_model(epoch=epoch, logs=logs)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/keras/callbacks.py”, line 1393, in _save_model
self.model.save_weights(
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py”, line 2124, in save_weights
self._trackable_saver.save(filepath, session=session, options=options)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/training/tracking/util.py”, line 1215, in save
file_io.recursive_create_dir(os.path.dirname(file_prefix))
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py”, line 468, in recursive_create_dir
recursive_create_dir_v2(dirname)
File “/home/home02/eeerog/.conda/envs/deep_learning/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py”, line 483, in recursive_create_dir_v2
_pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.PermissionDeniedError: /home/eeerog; Permission denied

Hello! It looks like this may be an issue with Tensorflow having Permission issues but I can look at your debug logs as they can help with debugging the issue. They should be located in the wandb folder in the same directory as where the script was run. The wandb folder has folders formatted as run-DATETIME-ID associated with a single run. Could you retrieve the debug.log and debug-internal.log files from one of these folders specifically from the run that is having issues?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.