Hi,
Overall my code is running quite well. However, there are two bottlenecks I am running into. One is automation related and maybe is best dealt with in a separate post. The bigger one is that the runs complete quite fast-- about half a second per wandb.run, but then the actual upload takes several-seconds to minutes on the “wandb: Waiting for W&B process to finish… (success).” I am wondering if it is possible to direct the sweep or wandb.run to skip uploading the run after every run and instead upload as a batch at the end or every n runs? That way, I can run an analysis script I have locally at the end of the actual runs, while waiting for the upload to finish. I am also not sure, but perhaps combining the tables preemptively before uploading across multiple runs might increase the upload speed? Where would I find the json tables files locally if I wanted to access them this way?
I am using the cmdline interface to run 3 instances of the wandb agent in parallel (which I think is the max possible), which perhaps is not the smartest way of handling so many runs.
Weights & Biases is designed to stream logs in real-time, which is why it uploads data after every run. However, you can use the WANDB_MODE=offline mode to train offline and sync results later. This mode allows you to log data to a local directory during training, and then you can manually sync your runs to the cloud later. Here’s how you can do it:
Set the environment variable
import os
os.environ[“WANDB_MODE”] = “offline”
Your training code here
After your training is done, you can sync your runs to the cloud with the following command:
Regarding your question about combining tables across multiple runs to increase upload speed, it’s not clear whether this would actually speed up the process. However, you can find the JSON tables files locally in the directory specified by the WANDB_DIR environment variable. If WANDB_DIR is not set, the default directory is ./wandb.
# Set the environment variable
os.environ["WANDB_DIR"] = os.path.abspath("your/directory")
Remember that the wandb process is separate from your training process, so it should not block your training. If you’re experiencing significant delays, it might be due to network issues or because you’re logging a large amount of data. If you’re logging less than once a second and less than a few megabytes of data at each step, the effect on your training performance should be negligible.
Can you also provide the following information for me please:
Can you explain your experiment a bit more?
What type of data are they logging
Can you provide the debug.log and debug-internal.log files. These files are under your local folder wandb/run-<date>_<time>-<run-id>/logs in the same directory where you’re running your code.
Hi @bkaplowitz , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!