transformers: Tokenizers throwing warning "The current process just got forked, Disabling parallelism to avoid deadlocks.. To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)"

I know this warning is because the transformer library is updated to 3.x. I know the warning saying to set TOKENIZERS_PARALLELISM = true / false

My question is where should i set TOKENIZERS_PARALLELISM = true / false is this when defining tokenizers like

tok = Tokenizer.from_pretrained('xyz', TOKENIZERS_PARALLELISM=True) // this doesn't work

or is this when encoding text like

tok.encode_plus(text_string, some=some, some=some, TOKENIZERS_PARALLELISM = True) // this also didn't work

Suggestions anyone?

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 47
Comments: 17 (5 by maintainers)

Commits related to this issue

Disabling tokenizer parallism for HF. Necessary, see https://stackoverflow.com/q/62691279 and https://github.com/huggingface/transformers/issues/5486 — committed to mlfoundations/open_clip by rom1504 2 years ago
Disabling tokenizer parallism for HF. Necessary, see https://stackoverflow.com/q/62691279 and https://github.com/huggingface/transformers/issues/5486 — committed to mlfoundations/open_clip by rom1504 2 years ago
Disable tokenizer parallelism (#208) * Make HFTokenizer lazy. Tokenizer is created lazily because huggingface tokenizers are not fork safe and prefer being created in each process * Disabling t... — committed to mlfoundations/open_clip by rom1504 2 years ago

Most upvoted comments

This is happening whenever you use multiprocessing (Often used by data loaders). The way to disable this warning is to set the TOKENIZERS_PARALLELISM environment variable to the value that makes more sense for you. By default, we disable the parallelism to avoid any hidden deadlock that would be hard to debug, but you might be totally fine while keeping it enabled in your specific use-case.

You can try to set it to true, and if your process seems to be stuck, doing nothing, then you should use false.

We’ll improve this message to help avoid any confusion (Cf https://github.com/huggingface/tokenizers/issues/328)

+27

n1t0 on Jul 6, 2020

I may be a rookie, but it seems like it would be useful to indicate that this is an environment variable in the warning message.

+24

nathan-chappell on Jul 9, 2020

I suspect this may be caused by loading data. In my case, it happens when my dataloader starts working.

+20

Vimos on Jul 5, 2020

Despite the documentation saying that use_fast defaults to False, adding use_fast=False so that it’s AutoTokenizer.from_pretrained(model_name, use_fast=False) removed this warning for me. If I just use AutoTokenizer.from_pretrained(model_name), the warning pops up again.

Jadiker on May 11, 2023

This might help you: https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning

hadarishav on Jul 5, 2020

After testing, it is found that when the data in a dataloader is processed by the token, and the datalodaer jumps out before it is finished, this warning will be triggered; I give a code example:

# for example, following code will trigger the warning
for texts in train_dataloader:
    _ = tokenizer.batch_encode_plus(texts)
    # loader has not been traversed
    # but texts are used
    break 
for texts in test_dataloader:
    # warning ...
    pass or break

# and following code will not trigger the warning
for texts in train_dataloader:
    # loader has not been traversed
    # but texts are not used
    break 
for texts in test_dataloader:
    # No warning 
    pass or break

hbchen121 on May 9, 2022

You are totally right! In the latest version 3.0.2, the warning message should be a lot better, and it will trigger only when necessary.

n1t0 on Jul 9, 2020

You must be using a tokenizer before using multiprocessing. When your process gets forked, you see this message because it detects that a fork is happening and that some kind of parallelism was used before.

n1t0 on May 6, 2021

I want to know if we can ignore this warning. What bad effects will it have? Will it affect the training results? Or is it just a little slower? If the environment variables are changed according to the above solution, what is the cost of doing so?

@hzphzp there is an explanation in SO https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning/72926996#72926996

crmuhsin on Aug 31, 2023

I want to know if we can ignore this warning. What bad effects will it have? Will it affect the training results? Or is it just a little slower? If the environment variables are changed according to the above solution, what is the cost of doing so?

hzphzp on Aug 30, 2023