### Describe the bug
I use sweeps with tf.keras multiprocessing option. Aft…er few runs (about 7-15) the run crash.
I get broken pipe error.
It seems that I get OOM on the CPU and each run add more memory usage to the CPU, so after some runs it floods.
```python
import wandb
wandb.login()
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
def train_standard():
tqdm_callback = tfa.callbacks.TQDMProgressBar()
callbacks = [keras.callbacks.ModelCheckpoint(filepath=model_path + "/save_at_{epoch}.h5"),
tqdm_callback, WandbMetricsLogger()]
model.fit(
x=data_generator.train_generator(batch_size),
steps_per_epoch=math.ceil(data_generator.get_train_size() / batch_size),
epochs=epochs,
shuffle=True,
initial_epoch=0,
verbose=2,
validation_data=data_generator.valid_generator(batch_size),
validation_steps=math.ceil(data_generator.get_valid_size() / batch_size),
callbacks=callbacks,
use_multiprocessing=True,
workers=6
def main():
wandb.init(name=DATASET)
wandb.require("service")
config = wandb.config
preprocessing(smooth_factor, sample_factor)
train_standard(batch_size, epochs, learning_rate, dropout, dense_layer, activation, hyperparameter_search=True)
wandb.finish()
sweep_config = {
'method': 'grid',
'metric': {
'goal': 'maximize',
'name': 'weighted avg.f1-score'
},
'parameters': {
'batch_size': {'values': [8,16,32]},
'epochs': {'values': [3,6,9]},
'lr': {'values': [0.001, 0.0001]},
'dropout': {'values': [0.1, 0.2, 0.5]},
'dense_layer' : {'values': [16, 32, 64]},
'activation': {'values': ['relu', 'tanh', 'sigmoid']},
'smooth_factor' : {'values': [10]},
'sample_factor' : {'values': [3, 5, 10]},
}
}
if __name__ == '__main__':
sweep_id = wandb.sweep(sweep_config, project="shaoxing-hyperparameter-tuning1")
wandb.agent(sweep_id, function=main)
```
```shell
Training: 0%| 0/6 ETA: ?s, ?epochs/sEpoch 1/6
Epoch 1/6 ETA: ?s -
6805/6805█ ETA: 00:00s - loss: 0.4434 - categorical_accuracy: 0.8402 - val_loss: 0.5995 - val_categorical_accuracy: 0
Training: 17%|███████████▌ 1/6 ETA: 20:51s, 250.28s/epochs6805/6805 - 250s - loss: 0.4434 - categorical_accuracy: 0.8402 - val_loss: 0.5995 - val_categorical_accuracy: 0.7694 - 250s/epoch - 37ms/step
Epoch 2/6
Epoch 2/6 ETA: ?s -
6805/6805█ ETA: 00:00s - loss: 0.2890 - categorical_accuracy: 0.9054 - val_loss: 0.5200 - val_categorical_accuracy: 0
Training: 33%|███████████████████████ 2/6 ETA: 16:36s, 249.10s/epochs6805/6805 - 248s - loss: 0.2890 - categorical_accuracy: 0.9054 - val_loss: 0.5200 - val_categorical_accuracy: 0.7900 - 248s/epoch - 36ms/step
Epoch 3/6
Epoch 3/6 ETA: ?s -
6805/6805█ ETA: 00:00s - loss: 0.2802 - categorical_accuracy: 0.9076 - val_loss: 0.5400 - val_categorical_accuracy: 0
Training: 50%|██████████████████████████████████▌ 3/6 ETA: 12:28s, 249.48s/epochs6805/6805 - 250s - loss: 0.2802 - categorical_accuracy: 0.9076 - val_loss: 0.5400 - val_categorical_accuracy: 0.8057 - 250s/epoch - 37ms/step
Epoch 4/6
Epoch 4/6 ETA: ?s -
6805/6805█ ETA: 00:00s - loss: 0.2605 - categorical_accuracy: 0.9134 - val_loss: 0.4429 - val_categorical_accuracy: 0
Training: 67%|██████████████████████████████████████████████ 4/6 ETA: 08:20s, 250.47s/epochs6805/6805 - 252s - loss: 0.2605 - categorical_accuracy: 0.9134 - val_loss: 0.4429 - val_categorical_accuracy: 0.8262 - 252s/epoch - 37ms/step
Epoch 5/6
Epoch 5/6 ETA: ?s -
6805/6805█ ETA: 00:00s - loss: 0.2627 - categorical_accuracy: 0.9078 - val_loss: 0.5205 - val_categorical_accuracy: 0
Training: 83%|█████████████████████████████████████████████████████████▌ 5/6 ETA: 04:10s, 250.99s/epochs6805/6805 - 252s - loss: 0.2627 - categorical_accuracy: 0.9078 - val_loss: 0.5205 - val_categorical_accuracy: 0.7949 - 252s/epoch - 37ms/step
Epoch 6/6
Epoch 6/6 ETA: ?s -
Process Keras_worker_ForkPoolWorker-149:███████████████████▉ ETA: 00:00s - loss: 0.2400 - categorical_accuracy: 0.9175
Process Keras_worker_ForkPoolWorker-152:
Process Keras_worker_ForkPoolWorker-151:
[3]+ Killed python train.py
Killed
Process Keras_worker_ForkPoolWorker-150:
(tensorflow) ubuntu@ip-172-31-92-126:~/walkky-ML/classifier/ml/heart_beat_cnn$ git pull && python train.pyTraceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 409, in _send_bytes
self._send(header)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 410, in _send_bytes
self._send(buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 409, in _send_bytes
self._send(header)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 409, in _send_bytes
self._send(header)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
BrokenPipeError: [Errno 32] Broken pipe
BrokenPipeError: [Errno 32] Broken pipe
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
During handling of the above exception, another exception occurred:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
BrokenPipeError: [Errno 32] Broken pipe
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
BrokenPipeError: [Errno 32] Broken pipe
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
```
### Additional Files
[https://wandb.ai/walkky/shaoxing-hyperparameter-tuning1/sweeps/u1jduer9/overview?workspace=user-srulikbd](url)
### Environment
WandB version:
0.13.9
OS:
Ubuntu 20.04.5 LTS (GNU/Linux 5.15.0-1026-aws x86_64)
with
Deep Learning AMI GPU TensorFlow 2.11.0
Python version:
3.10.7
Versions of relevant libraries:
NVIDIA driver version: 515.65.01
CUDA version: 11.2
TensorFlow 2.11.0
### Additional Context
it seems that there is a lot of python processes created during running:
![image](https://user-images.githubusercontent.com/35503583/217542844-faa69a1f-0ce4-4d1b-9b1d-bf401a25947e.png)
maybe it's related to the error?
![image](https://user-images.githubusercontent.com/35503583/217543017-f74f8456-ce17-4d5a-ac3e-cdb33901edfd.png)