Hello,
So I have changed my YAML file environment variable from “python3” to “accelerate launch”. I am trying to use this in conjunction with wandb agent <username/proj_name/sweep_id> on a SLURM compute cluster.
So the error is that it runs the main function 4 times, which then instantiates the arguments 4 times and we error out because it is trying to create a page that was created on the first step. And then inevitably fails.
I should mention that the script does work with just python3 and so it is a matter of using “accelerate launch” to take advantage of my multiple GPUs.
#!/bin/bash
#SBATCH --job-name=tav_mae
# Give job a name
#SBATCH --time 02-20:00 # time (DD-HH:MM)
#SBATCH --nodes=1
#SBATCH --gpus-per-node=v100l:4 # request GPU
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.
#SBATCH --mem=150G # memory per node
#SBATCH --account=ctb-whkchun # Runs it on the dedicated nodes we have
#SBATCH --output=/scratch/prsood/tav_mae/logs/%N-%j.out # %N for node name, %j for jobID # Remember to mae logs-dir
module load StdEnv/2020
module load cuda
module load cudnn/8.0.3
wandb agent ddi/TAVFormer2/ncdfi75j --count 20
YAML configuration file:
program: ../tav_nn.py
command:
- ${env}
- accelerate
- launch
- ${program}
- "--dataset"
method: bayes
metric:
goal: minimize
name: train/train_loss
parameters:
epoch:
values: [5 , 7 , 9]
learning_rate:
distribution: uniform
min: 0.000001
max: 0.0001
batch_size:
values: [2 , 4 , 8 , 1]
weight_decay:
values: [0.0001 , 0.00001 , 0.000001 , 0.0000001, 0.00000001]
seed:
values: [32, 64, 96]
dropout:
values: [0.0,0.1,0.2]
early_div:
values: [True,False]
patience:
values: [10]
clip:
values: [1]
T_max:
values: [5,10]
hidden_layers:
values: ["300"]
Error output
wandb: Starting wandb agent 🕵️
2023-02-03 00:32:20,126 - wandb.wandb_agent - INFO - Running runs: []
2023-02-03 00:32:21,726 - wandb.wandb_agent - INFO - Agent received command: run
2023-02-03 00:32:21,728 - wandb.wandb_agent - INFO - Agent starting run with config:
T_max: 10
batch_size: 2
clip: 1
dropout: 0.1
early_div: True
epoch: 7
hidden_layers: 300
label_task: emotion
learning_rate: 3.736221739657802e-05
model: MAE_encoder
patience: 10
seed: 96
weight_decay: 0.0001
2023-02-03 00:32:21,736 - wandb.wandb_agent - INFO - About to run command: /usr/bin/env accelerate launch ../tav_nn.py --dataset ../../data/IEMOCAP_df
2023-02-03 00:32:26,772 - wandb.wandb_agent - INFO - Running runs: ['6fnujhey']
wandb: Currently logged in as: prsood (ddi). Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: prsood (ddi). Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: prsood (ddi). Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: prsood (ddi). Use `wandb login --relogin` to force relogin
wandb: WARNING Ignored wandb.init() arg project when running a sweep.
wandb: WARNING Ignored wandb.init() arg entity when running a sweep.
wandb: WARNING Ignored wandb.init() arg project when running a sweep.
wandb: WARNING Ignored wandb.init() arg entity when running a sweep.
wandb: WARNING Ignored wandb.init() arg project when running a sweep.
wandb: WARNING Ignored wandb.init() arg entity when running a sweep.
wandb: WARNING Ignored wandb.init() arg project when running a sweep.
wandb: WARNING Ignored wandb.init() arg entity when running a sweep.
Thread WriterThread: wandb.init()...
Traceback (most recent call last):
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 50, in run
self._run()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 101, in _run
self._process(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal.py", line 351, in _process
self._wm.write(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/writer.py", line 28, in write
self.open()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/writer.py", line 24, in open
self._ds.open_for_write(self._settings.sync_file)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 77, in open_for_write
self._fp = open(fname, open_flags)
FileExistsError: [Errno 17] File exists: '/project/6051551/prsood/multi-modal-emotion/TripleModels/run_slurm/wandb/run-20230203_003340-6fnujhey/run-6fnujhey.wandb'
Thread WriterThread:
Traceback (most recent call last):
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 50, in run
self._run()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 101, in _run
self._process(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal.py", line 351, in _process
self._wm.write(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/writer.py", line 28, in write
self.open()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/writer.py", line 24, in open
self._ds.open_for_write(self._settings.sync_file)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 77, in open_for_write
self._fp = open(fname, open_flags)
FileExistsError: [Errno 17] File exists: '/project/6051551/prsood/multi-modal-emotion/TripleModels/run_slurm/wandb/run-20230203_003340-6fnujhey/run-6fnujhey.wandb'
Thread WriterThread:
Traceback (most recent call last):
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 50, in run
self._run()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 101, in _run
self._process(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/internal.py", line 351, in _process
self._wm.write(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/writer.py", line 28, in write
self.open()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/writer.py", line 24, in open
self._ds.open_for_write(self._settings.sync_file)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 77, in open_for_write
self._fp = open(fname, open_flags)
FileExistsError: [Errno 17] File exists: '/project/6051551/prsood/multi-modal-emotion/TripleModels/run_slurm/wandb/run-20230203_003340-6fnujhey/run-6fnujhey.wandb'
wandb: ERROR Internal wandb error: file data was not synced
Problem at: ../tav_nn.py 106 main...
Traceback (most recent call last):
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
run = wi.init()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 739, in init
_ = backend.interface.communicate_run_start(run_obj)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 235, in communicate_run_start
result = self._communicate_run_start(run_start)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 484, in _communicate_run_start
result = self._communicate(rec)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 255, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface_sock.py", line 58, in _communicate_async
future = self._router.send_and_receive(rec, local=local)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/router.py", line 94, in send_and_receive
self._send_message(rec)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/router_sock.py", line 36, in _send_message
self._sock_client.send_record_communicate(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 216, in send_record_communicate
self.send_server_request(server_req)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: ERROR Abnormal program exit..
Traceback (most recent call last):
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
run = wi.init()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 739, in init
_ = backend.interface.communicate_run_start(run_obj)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 235, in communicate_run_start
result = self._communicate_run_start(run_start)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 484, in _communicate_run_start
result = self._communicate(rec)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 255, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/interface_sock.py", line 58, in _communicate_async
future = self._router.send_and_receive(rec, local=local)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/router.py", line 94, in send_and_receive
self._send_message(rec)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/interface/router_sock.py", line 36, in _send_message
self._sock_client.send_record_communicate(record)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 216, in send_record_communicate
self.send_server_request(server_req)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "../tav_nn.py", line 183, in <module>
main()
File "../tav_nn.py", line 106, in main
run = wandb.init(project=project_name, entity="ddi" , config = args)
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1116, in init
raise Exception("problem") from error_seen
Exception: problem
Problem at:Problem at:Problem at: ../tav_nn.py 106 main
../tav_nn.py 106../tav_nn.py main106
main
Traceback (most recent call last):
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
run = wi.init()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 698, in init
timeout=self.settings.init_timeout, on_progress=self._on_progress_init
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/mailbox.py", line 259, in wait
raise MailboxError("transport failed")
wandb.errors.MailboxError: transport failed
Traceback (most recent call last):
Traceback (most recent call last):
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
run = wi.init()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
run = wi.init()
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 698, in init
timeout=self.settings.init_timeout, on_progress=self._on_progress_init
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/mailbox.py", line 259, in wait
raise MailboxError("transport failed")
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 698, in init
timeout=self.settings.init_timeout, on_progress=self._on_progress_init
wandb.errors.MailboxError: transport failed
File "/project/6051551/prsood/sarcasm_venv/lib/python3.7/site-packages/wandb/sdk/lib/mailbox.py", line 259, in wait
raise MailboxError("transport failed")
wandb.errors.MailboxError: transport failed
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 213168 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 213170 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 213171 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 213169) of binary: /project/6051551/prsood/sarcasm_venv/bin/python3
2023-02-03 00:33:54,480 - wandb.wandb_agent - INFO - Cleaning up finished run: 6fnujhey
accelerate config file:
- `Accelerate` version: 0.16.0
- Platform: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core
- Python version: 3.7.7
- Numpy version: 1.21.4
- PyTorch version (GPU?): 1.10.0 (False)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: no
- use_cpu: False
- dynamo_backend: NO
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 2, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_transformer_layer_cls_to_wrap': 'TransformerBlock'}
- megatron_lm_config: {}
- downcast_bf16: no
My initial python script that it can’t get past
def main():
os.environ["TOKENIZERS_PARALLELISM"] = "true"
project_name = "MLP_test_text"
args = arg_parse(project_name)
run = wandb.init(project=project_name, entity="ddi" , config = args)
config = wandb.config
np.random.seed(config.seed)
torch.random.manual_seed(config.seed)
param_dict = {
'epoch':config.epoch ,
'patience':config.patience ,
'lr': config.learning_rate ,
'clip': config.clip ,
'batch_size':8,#config.batch_size ,
'weight_decay':config.weight_decay ,
'model': config.model,
'T_max':config.T_max ,
'seed':config.seed,
'label_task':config.label_task,
}
df = pd.read_pickle(f"{args.dataset}.pkl")
if param_dict['label_task'] == "sentiment":
number_index = "sentiment"
label_index = "sentiment_label"
else:
number_index = "emotion"
label_index = "emotion_label"
df_train = df[df['split'] == "train"]
df_test = df[df['split'] == "test"]
df_val = df[df['split'] == "val"]
df = df[~df['timings'].isna()] # Still seeing what the best configuration is for these
"""
Due to data imbalance we are going to reweigh our CrossEntropyLoss
To do this we calculate 1 - (num_class/len(df)) the rest of the functions are just to order them properly and then convert to a tensor
"""
weights = torch.Tensor(list(dict(sorted((dict(1 - (df[number_index].value_counts()/len(df))).items()))).values()))
label2id = df.drop_duplicates(label_index).set_index(label_index).to_dict()[number_index]
id2label = {v: k for k, v in label2id.items()}
model_param = {
'output_dim':len(weights) ,
'dropout' : config.dropout,
'early_div' : config.early_div
}
param_dict['weights'] = weights
param_dict['label2id'] = label2id
param_dict['id2label'] = id2label
print(f" in main \n param_dict = {param_dict} \n model_param = {model_param} \n df {args.dataset} , with df_size = {len(df)} \n ")
world_size = torch.cuda.device_count()
print(f"world_size = {world_size}" , flush=True)
runModel("cuda", world_size ,df_train , df_val , df_test ,param_dict , model_param , run )
if __name__ == '__main__':
main()