Rate Limit Exceeded


Since recently, I have been getting this error when browsing the runs/workspace of my latest project.

For info, it is a segmentation tasks trained in pytorch using fastai callback.

My log looks like this:

0: *Quadro RTX 8000,         48.6GB, tensor_cores=72
1: GeForce RTX 2070 SUPER,   8.0GB, tensor_cores=40
 Selecting GPU : Quadro RTX 8000
Exception ignored in: <function _releaseLock at 0x7f3232131ee0>
Traceback (most recent call last):
  File "/home/tcapelle/miniconda3/envs/pytorch_18/lib/python3.8/logging/__init__.py", line 227, in _releaseLock
    def _releaseLock():
WandbCallback requires use of "SaveModelCallback" to log best model
wandb: 429 encountered ({"error":"rate limit exceeded"}), retrying request
wandb: Network error resolved after 0:00:08.499452, resuming normal operation.
wandb: 429 encountered ({"error":"rate limit exceeded"}), retrying request
wandb: 429 encountered ({"error":"rate limit exceeded"}), retrying request
wandb: Network error resolved after 0:00:09.337473, resuming normal operation.
wandb: 429 encountered ({"error":"rate limit exceeded"}), retrying request
wandb: Network error resolved after 0:00:08.679539, resuming normal operation.

this happens in the training notebooks and in the wandb page afterwards.

Thanks for flagging! We’re looking into it :tea:

It appears the issue is due to logging either too frequently from one run, from too many runs in parallel, or loading too much data in the UI. If we can track down which one it is, that will help debug. -Is it possible for you to share the run page/link? That might help debug the same.

The team had also deployed a fix related to the same today, could you please confirm if you’re facing the issue today as well?

TIA! :slight_smile:

Thanks, Thomas. Since the link is visible to everyone, I deleted your comment to keep it private. I hope that’s okay.

I’ll get back once the team takes a look :slight_smile:

It appears to be working fine now.

Thank you, please let me know incase this/any other issues reappear :smiley:

Same problem here. It appears to be some IP based restriction on WandB’s side. Not very happy with this :frowning: especially since It’s blocking training which hasn’t even been started - so I’ve no idea why its reaching arbitrary ratelimits

Hi Neel! It’s nice to meet you, I’m Leslie from the support team. I have increased your rate limits for your account.

Hi @lesliewandb , I am encountering the same issue. Can you please help me fix it?

Hi, I’m getting the same issue here.

Pretty sure in my case it’s logging from too many runs in parallel (200). How do I decrease the logging frequency?

This seems relevant:

The W&B API is rate limited by IP and API key. New accounts are restricted to 200 requests per minute. This rate allows you to run approximately 15 processes in parallel and have them report without being throttled. If the wandb client detects it’s being limited, it will backoff and retry sending the data in the future. If you need to run more than 15 processes in parallel send an email to contact@wandb.com.

I don’t mind the client sending less frequently but I’d like to get rid of the warnings clogging up my log files:

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 2s 2s/step
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 2.440871229689856 seconds), retrying request
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 4.7759137728990915 seconds), retrying request
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 9.637983936851159 seconds), retrying request

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 23ms/step
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 17.76077947047857 seconds), retrying request
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 2.2036541589775873 seconds), retrying request
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 4.225510290209322 seconds), retrying request

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 25ms/step
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 9.133716207985449 seconds), retrying request
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 18.076738516025497 seconds), retrying request

Looks like this is where the logging happens

Maybe need to set a larger retry_polling_interval on RunStatusChecker?


Doesn’t appear to help. I tried

# put this line after wandb.init()
# https://community.wandb.ai/t/753/14
wandb.run._run_status_checker._retry_polling_interval = 50  # type: ignore

Same problem here. Can you please help me fix it?

same issue!

wandb: 429 encountered (Filestream rate limit exceeded, retrying in 2.4 seconds.), retrying request



wandb: 429 encountered (Filestream rate limit exceeded, retrying in 4.7 seconds.), retrying request
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 2.0 seconds.), retrying request
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 4.1 seconds.), retrying request

My runs aren’t even logging, they only log at the end a single number. Why is wandb complaining?

This is the code:

def experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights():
    Get divs using pt ft, pt (rand, rand ft?) 
    - div c4 
    - div wt = wt-103
    Then with unioned datasets
    - div c4+wt, uniform [0.5, 0.5]
    - # div c4+wt, data set size proportions (using GBs)
    - div c4+wt, respect doremi
    - div c4+wt, respect the pile
    - div c4+wt, respect gpt3 weights
    then repeat all with pt (no ft)
    import random
    from diversity.data_mixtures import get_uniform_data_mixture_for_c4_wt103, get_doremi_based_data_mixture_for_c4_wt103, get_llama_v1_based_data_mixture_for_c4_wt103
    probabilities = []
    data_mixture_name = None
    streaming = True
    data_files = [None]
    seed = 0
    # -- Setup wandb
    import wandb
    # - Dryrun
    # mode = 'dryrun'; num_batches = 3
    mode = 'dryrun'; num_batches = 3; seed = random.randint(0, 2**32 - 1)

    # - Online (real experiment)
    mode='online'; num_batches = 600; seed = random.randint(0, 2**32 - 1)
    # - c4 wt single
    # path, name = 'c4', 'en'
    # path, name = "wikitext", 'wikitext-103-v1'
    # path, name = 'Skylion007/openwebtext', None
    # - c4 wt mix
    # path, name, data_files = ['c4', 'wikitext'], ['en', 'wikitext-103-v1'], [None, None]
    # probabilities, data_mixture_name = get_uniform_data_mixture_for_c4_wt103()
    # probabilities, data_mixture_name = get_doremi_based_data_mixture_for_c4_wt103()
    # probabilities, data_mixture_name = get_llama_v1_based_data_mixture_for_c4_wt103()
    # probabilities, data_mixture_name = [0.75, 0.25], '[0.75, 0.25]' 
    # probabilities, data_mixture_name = [0.25, 0.75], '[0.25, 0.75]' 
    # - pile, pile cc single 
    # path, name = 'EleutherAI/pile', 'all'
    # path, name = 'conceptofmind/pile_cc', 'sep_ds'
    # - 5 subsets of pile using hf data set viewer (parquet)) 
    from diversity.pile_subset_urls import urls_hacker_news, urls_nih_exporter, urls_pubmed, urls_uspto
    # path, name, data_files = 'conceptofmind/pile_cc', 'sep_ds', [None]
    # path, name, data_files = 'parquet', 'hacker_news', urls_hacker_news
    # path, name, data_files = 'parquet', 'nih_exporter', urls_nih_exporter
    # path, name, data_files = 'parquet', 'pubmed', urls_pubmed
    path, name, data_files = 'parquet', 'uspto', urls_uspto
    # - 5 subsets of the pile interleaved
    # from diversity.pile_subset_urls import urls_hacker_news, urls_nih_exporter, urls_pubmed, urls_uspto
    # from diversity.data_mixtures import get_uniform_data_mixture_5subsets_of_pile, get_doremi_data_mixture_5subsets_of_pile, get_llama_v1_data_mixtures_5subsets_of_pile
    # path, name, data_files = ['conceptofmind/pile_cc'] + ['parquet'] * 4, ['sep_ds'] + ['hacker_news', 'nih_exporter', 'pubmed', 'uspto'], [None] + [urls_hacker_news, urls_nih_exporter, urls_pubmed, urls_uspto]
    # probabilities, data_mixture_name = get_uniform_data_mixture_5subsets_of_pile()
    # probabilities, data_mixture_name = get_llama_v1_data_mixtures_5subsets_of_pile(name)
    # probabilities, data_mixture_name = get_doremi_data_mixture_5subsets_of_pile(name)
    # - not changing
    batch_size = 512
    today = datetime.datetime.now().strftime('%Y-m%m-d%d-t%Hh_%Mm_%Ss')
    run_name = f'{path} div_coeff_{num_batches=} ({today=} ({name=}) {data_mixture_name=} {probabilities=})'
    print(f'\n---> {run_name=}\n')

    # - Init wandb
    debug: bool = mode == 'dryrun'
    run = wandb.init(mode=mode, project="beyond-scale", name=run_name, save_code=True)
    wandb.config.update({"num_batches": num_batches, "path": path, "name": name, "today": today, 'probabilities': probabilities, 'batch_size': batch_size, 'debug': debug, 'data_mixture_name': data_mixture_name, 'streaming': streaming, 'data_files': data_files, 'seed': seed})
    # run.notify_on_failure() # https://community.wandb.ai/t/how-do-i-set-the-wandb-alert-programatically-for-my-current-run/4891

    # -- Get probe network
    from datasets import load_dataset 
    from datasets.iterable_dataset import IterableDataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    # -- Get data set
    def my_load_dataset(path, name, data_files=data_files):
        print(f'{path=} {name=} {streaming=} {data_files=}')
        if path == 'json' or path == 'bin' or path == 'csv':
            return load_dataset(path, data_files=data_files_prefix+name, streaming=streaming, split="train").with_format("torch")
        elif path == 'parquet':
            return load_dataset(path, data_files=data_files, streaming=streaming, split="train").with_format("torch")
            return load_dataset(path, name, streaming=streaming, split="train").with_format("torch")
    # - get data set for real now
    if isinstance(path, str):
        dataset = my_load_dataset(path, name, data_files)
        # -Interleaving datasets
        print('- Interleaving datasets')
        datasets = [my_load_dataset(path, name, data_files).with_format("torch") for path, name, data_files in zip(path, name, data_files)]
        # datasets = [my_load_dataset(path, name).with_format("torch") for path, name in zip(path, name)]
        if any('parquet' == p for p in path) or path == 'parquest':  # idk why I need to do this, I checked very carefully and deleted all columns so interleaved data set matched but when doing this with c4 & wikitext it fails but with the parquet it works https://discuss.huggingface.co/t/why-does-deleting-the-columns-before-giving-it-to-interleave-work-but-sometimes-it-does-not-work/50879
            dataset_descriptions = [dataset.description for dataset in datasets]  # print description if available
            # - make sure all datasets have the same columns to avoid interleave to complain
            all_columns = [col for dataset in datasets for col in dataset.column_names]
            columns_to_remove = [col for dataset in datasets for col in dataset.column_names if col != 'text']
            columns_to_remove = list(set(columns_to_remove))  # remove duplicates
            datasets = [dataset.remove_columns(columns_to_remove) for dataset in datasets]
            # - interleave
            dataset_descriptions = [dataset.description for dataset in datasets]  # print description if available
        dataset = interleave_datasets(datasets, probabilities)
        # dataset = dataset.remove_columns(columns_to_remove)
    # datasets.iterable_dataset.IterableDataset
    # datasets.arrow_dataset.Dataset
    # dataset = IterableDataset(dataset) if type(dataset) != IterableDataset else dataset  # to force dataset.take(batch_size) to work in non-streaming mode
    raw_text_batch = dataset.take(batch_size) if streaming else dataset.select(range(batch_size))
    column_names = next(iter(raw_text_batch)).keys()

    # - Prepare functions to tokenize batch
    def preprocess(examples):
        return tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
    remove_columns = column_names  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    def map(batch):
        return batch.map(preprocess, batched=True, remove_columns=remove_columns)
    tokenized_batch = map(raw_text_batch)

    # -- Compute diversity coefficient
    print(f'-- Compute diversity coefficient')
    print(f'{seed=}, {streaming=}')
    # - Debug run
    # results: dict = get_diversity_coefficient(dataset, map, probe_network, num_batches=3, seed=seed, debug=True, shuffle=False)  # (quick debug) hardcoded for debugging
    # results: dict = get_diversity_coefficient(dataset, map, probe_network, num_batches=3, seed=seed, debug=True, shuffle=True)  # (slow debug) hardcoded for debugging
    # results: dict = get_diversity_coefficient(dataset, map, probe_network, num_batches=3, seed=seed, debug=False, shuffle=False)  # (real) hardcoded for debugging
    # - Real run
    # assert not debug, f'Err: {debug=} for real run'
    results: dict = get_diversity_coefficient(dataset, map, probe_network, num_batches=num_batches, seed=seed, debug=debug, shuffle=False)
    # results: dict = get_diversity_coefficient(dataset, map, probe_network, num_batches=num_batches, seed=seed, debug=debug, shuffle=True)
    # - Log results
    div_coeff, div_coeff_ci = results['div_coeff'], results['div_coeff_ci']
    print(f'{div_coeff=} {div_coeff_ci=}')
    wandb.log({'div_coeff': div_coeff, 'div_coeff_ci': div_coeff_ci})

    # -- Save results or not
    save_results = True
    if save_results:
        output_dir = Path(f'~/data/div_coeff/{today}').expanduser()
        output_dir.mkdir(parents=True, exist_ok=True)
        np.save(output_dir / f'distance_matrix{today}.npy', results['distance_matrix'])
        np.save(output_dir / f'results{today}.npy', results)
        # Save results as a pretty-printed JSON
        results = {key: str(value) for key, value in results.items()}
        with open(output_dir / f'results{today}.json', 'w') as f:
            json.dump(results, f, indent=4)
        # - wandb save
        base_path = str(output_dir.parent)
        wandb.save(str(output_dir / f'distance_matrix{today}.npy'), base_path=base_path)
        wandb.save(str(output_dir / f'results{today}.npy'), base_path=base_path)
        wandb.save(str(output_dir / f'results{today}.json'), base_path=base_path)

Those are the oly times I call wandb @_scott

Hi @brando @weikang we rolled back a change made to rate limiting, so if you run your code again it should work. Please let us know if you are still encountering this issue, or if that’s now resolved for you.

At least in my case, this keeps happening.

Is there a way to just mute the notifications while it gets resolved?

In my case, it still happens to me, and my code does not require sigificant communication burden.