tokenizers: RuntimeError: Already borrowed

We’re using transformers (3.5.0) with a fast tokenizer (0.9.3) in production, but sometimes a RuntimeError with Already borrowed is raised (this might come from Rusts’s borrowing mechanisms?). This happens actually quite often, but I’m not sure yet why and how to reproduce this.

However, this is where the error is raised:

https://github.com/huggingface/tokenizers/blob/598ce61229c789465966682687fa12a90ec58074/bindings/python/py_src/tokenizers/implementations/base_tokenizer.py#L107-L123

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 19
  • Comments: 46 (3 by maintainers)

Commits related to this issue

Most upvoted comments

The error still exists in: transformers==4.3.2, tokenizers==0.10.1. I am using gunicorn (with threads) with flask and the error shows if parallel requests are made.

The problem does not exist in transformers==3.0.2, tokenizers==0.8.1.

Alright, that’s what I feared. This is happening because you have a single tokenizer, that is used by 2 different threads. While the tokenizer is encoding (on one thread), if the other thread tries to modify it, this error happens because it cannot be modified while being used at the same time.

I think the easiest way to fix it, for now, will be to ensure you have an instance of the tokenizer for each thread.

We should be able to fix this in transformers by making sure we update the truncation/padding info only if necessary (cc @LysandreJik @thomwolf). And we should also be able to improve this error to make it clearer on tokenizers.

For those who may not be able to use the latest branch of this repository due to experimental work or other custom modifications: Wrapping the request into a mutex acquire/release statement does the job as well, as done here.

from threading import Lock
MUTEX = Lock()

MUTEX.acquire()
try:
    input_ids = self.tokenizer(...)
    output = self.model(...)
finally:
    MUTEX.release()

Good discussion. But I don’t quite understand why this truncation/padding info has to be global. It can be passed as a parameter so that each tokenize call will be threadsafe.

Yes, you cannot do this.

tokenizer is thread-safe, but not meant to be used concurrently (hence the error which safe 2 threads are trying to access the same thing at the same time, which is not allowed)

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

from concurrent.futures import ThreadPoolExecutor
from transformers import AutoTokenizer

PARALLELISM = 2

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC.
This book is a treatise on the theory of ethics, very popular during the Renaissance. The
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""


def preprocess_text(text, max_length=512):
    tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer(
        " ".join(sentences_to_keep), padding="max_length", max_length=max_length
    )


with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text) for i in range(PARALLELISM)]
    return_value = [future.result() for future in futures]
    print(return_value)

This works for instance (each thread gets its own copy of the tokenizer).

In the case where you are reusing threads for more tasks:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

import threading
from concurrent.futures import ThreadPoolExecutor
from transformers import AutoTokenizer

PARALLELISM = 2

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, 
and going through the cites of the word in classical literature, discovered the undoubtable source. 
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. 
This book is a treatise on the theory of ethics, very popular during the Renaissance. The 
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""

TOKENIZER = {}


def get_tokenizer():
    _id = threading.get_ident()
    tokenizer = TOKENIZER.get(_id, None)
    if tokenizer is None:
        tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
        TOKENIZER[_id] = tokenizer
    return tokenizer


def preprocess_text(text, max_length=512):
    tokenizer = get_tokenizer()
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer(
        " ".join(sentences_to_keep), padding="max_length", max_length=max_length
    )


with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text) for i in range(PARALLELISM)]
    return_value = [future.result() for future in futures]
    print(return_value)

should work, and each thread will get its tokenizer.

Sharing tokenizer across threads is fixable but not desirable, it will just slow everything down since it’s likely we’ll just mutex it causing each thread to wait their turn for each other. Given that tokenizers are relatively small objects, making each thread have its own seems better.

Lock-free sharing is just too complex for what it would bring (and prevent ANY modification of the underlying tokenizer which is what you are doing without realizing).

tokenizer(...) and tokenizer(..., padding="max_length") need to modify the underlying object since the padding strategy is part of it.

As a side note, another way to fix it (which I don’t recommend) is:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

from concurrent.futures import ThreadPoolExecutor
from transformers import RobertaTokenizerFast

PARALLELISM = 2
tokenizer = RobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer2 = RobertaTokenizerFast.from_pretrained("xlm-roberta-base")
# This mutates tokenizer2 to include the strategy before sharing
tokenizer2("test", padding="max_length", max_length=512)

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, 
and going through the cites of the word in classical literature, discovered the undoubtable source. 
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. 
This book is a treatise on the theory of ethics, very popular during the Renaissance. The 
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""


def preprocess_text(text, tokenizer, max_length=512):
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer2(
        " ".join(sentences_to_keep), padding="max_length", max_length=max_length
    )


with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [
        executor.submit(preprocess_text, raw_text, tokenizer)
        for i in range(PARALLELISM)
    ]
    return_value = [future.result() for future in futures]
    print(return_value)

I want to add a comment to illustrate a specific example for which we found a workaround. We also faced this error when running preprocessing on aiohttp API with concurrent requests. Neither #12550 nor setting TOKENIZERS_PARALLELISM=0 helped with it. Our preprocessing logic is made of 2 steps:

  • Tokenize individual sentences (so without padding) to get the number of token in each sentences
  • Combine sentences up to max length (Either we take a full sentence or we drop it) using their number of tokens to guarantee final max length
  • Tokenize combined sentences (with padding enabled to max length)

Here is a full example to reproduce the error:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

from concurrent.futures import ThreadPoolExecutor
from transformers import RobertaTokenizerFast

PARALLELISM = 2
tokenizer = RobertaTokenizerFast.from_pretrained("./tokenizer/")

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, 
and going through the cites of the word in classical literature, discovered the undoubtable source. 
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. 
This book is a treatise on the theory of ethics, very popular during the Renaissance. The 
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""


def preprocess_text(text, tokenizer, max_length=512):
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer(" ".join(sentences_to_keep),
                     padding='max_length',
                     max_length=max_length)


with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text, tokenizer) for i in range(PARALLELISM)]
    return_value = [future.result() for future in futures]

Even a parallelism of 2 is enough to trigger the RuntimeError: Already borrowed.

The warkaround we foud for this situation is to create 2 seprate instance of tokenizers, one for each truncation/padding configuration:

  • tokenizer_a for tokenizing without padding/truncation
  • tokenizer_b for tokenizing with padding max_length and truncation ignored

So by changing the code as the following we no more have this error even with more concurrency:

tokenizer_a = RobertaTokenizerFast.from_pretrained("./tokenizer/")
tokenizer_b = RobertaTokenizerFast.from_pretrained("./tokenizer/")

def preprocess_text(text, tokenizer_a, tokenizer_b, max_length=512):
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer_a(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer_b(" ".join(sentences_to_keep),
                       padding='max_length',
                       max_length=max_length)

with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text, tokenizer_a, tokenizer_b) for i in range(100)]
    return_value = [future.result() for future in futures]

I hope this may be helpful for some of you.

@Narsil - I can confirm the observation of @oborchers

I can reproduce with these two:

# server.py
from allennlp.predictors.predictor import Predictor
from fastapi import FastAPI

app = FastAPI()
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/transformer-qa.2021-02-11.tar.gz")


@app.get("/predict")
def predict_answer(passage: str, question: str):
    result = predictor.predict(
        passage=passage,
        question=question
    )
    return result["best_span_str"]

# client.py
import asyncio

import aiohttp


async def main():
    url = "http://localhost:8000/predict"
    params = dict(
        passage="The Matrix is a 1999 science fiction action film written and directed by The Wachowskis, starring Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, and Joe Pantoliano.",
        question="Who stars in The Matrix?",
    )
    coros = (fetch(url, params) for _ in range(2))
    await asyncio.gather(*coros)


async def fetch(url, params=None):
    async with aiohttp.ClientSession() as session:
        async with session.get(url, params=params) as response:
            print(await response.json())


if __name__ == "__main__":
    asyncio.run(main())

If you change the client to fetch only 1 coro you do not hit the error. But if you have 2 you get RuntimeError: Already borrowed

This may come incredibly late, but if you are working with micro-services and are willing to exchange the base call by a post request I would much rather suggest to:

All my scaling and threading headaches when working with this in pure fastapi/flask fashion are resolved since then.

It’s about the choice that was done about padding_strategy.

Making it stateless means that every single call from python to rust needs to pass it to the caller. Meaning there’s is a string passing the Python->Rust boundary for every single call.

It turns out that Python -> Rust is not a free boundary, some calculations have to happen. We didn’t make actual measurements, but it could be quite hurtful to make rust purely stateless.

Since in most cases users use either padding or no padding strategy (usually training vs inference) then being it stateful is correct in most cases. The last version showcases how to actually have only 2 stateless tokenizers.

Hope that helps.

asyncio doesn’t change anything to how your example should fail. It’s the threading that’s causing issues, not async (since tokenizers will block the thread anyway)

Thanks for providing a solid testing script @jackhodkinson I have created a PR within transformers to reduce the amount of such errors: https://github.com/huggingface/transformers/pull/12550

Unfortunately, there’s no way to completely erase those errors without a major revamp of the encode function as truncation and padding are part of the core struct of a tokenizer. I think it should cover 99% of the cases though because padding and truncation options, shouldn’t ever be changed that often in reality.

Please read the PR for more details about what the problem is and how it attempts to solve it.

I am having the same problem. Simple reproduction would be:

  • FastAPI endpoint
  • TextClassifier pipeline loaded and stored under app.states
  • Query runs the pipeline
  • Query multiple requests at the same time

Hi, I have the same problem with gunicorn. For some models, it does work but for others it fails. I notice a difference between the 2 models:

This fails: self.token_indexer.encode(x, max_length=350, truncation=True) This seems to work:

self.token_indexer.encode(x, truncation=True)

The tokenizer is loaded at startup in guinicorn. When I receive a request, I try to tokenize the batch of text (probably in an another thread). Is it because the set_truncation_and_padding function tries to modify the backend tokenizer (self._tokenizer) which is already owned by the first thread? In the second case (which work) the _tokenizer is not modified because max_length is at default.

Could we pass this as an argument of the backend encoding function instead of modifying the backend tokenizer object?

I solved this problem using a thread lock in Python.

from threading import Lock 
lock = Lock()

lock.acquire()
model.encode()
lock.release()

Did I understand correctly that it is not a bug, but a slight misunderstanding of the non-thread safe nature of the python->rust boundary? Should it be closed then? Or maybe, as a part of a fix, one should compute what would it cost to make the calls stateless?

The easiest and least intrusive way IMHO is using a Python queue, which is multi-threaded per-se.

Let’s assume you have N threads, instead of creating one tokenizer instance per thread, one creates M tokenizer instances, where M could only be 1 as a default value - which is equivalent of using a simple lock. Inside the initialization you put M instances of your tokenizer into the queue and afterwards only use queue.get() and queue.put() when you need to access any of these instances.

The latter should be done inside a try: .. finalize: block, so that a tokenizer always is guaranteed to be returned into the queue again e.g. in case of exceptions.

queue.get() will block the calling thread as long as there are no available free tokenizer instances and will immediately unblock the thread if another thread puts a tokenizer back to the queue. As Python queues are FIFO’s, it’s also guaranteed that all elements inside the queue are used round-robin.

The necessary code is minimal and always thread-safe and you can decouple the number of your threads from the number of your tokenizer instances. This makes the resource usage very controllable as well.

Pros & Cons:

The problem with the above approach, is that as long as M is less than N, there will be thread-contention in heavy load situations. Most normal operating systems don’t make any guarantees for waiting threads to be scheduled in FIFO-order. This means, there is no latency guarantee for e.g. your gRPC or webserver thread to get hold of a tokenizer instance before another thread that took the queue later. In most cases this is not an issue, but if your server is under heavy load, that’s the reason why you often see high latency spikes. There is a reason there are realtime operating systems out there that make those guarantees.

I.e. if you are requiring strict latency timelines, you need to have M == N.

side-note

The issue here is not a tokenizer bug, it’s a misunderstanding of the user about the guarantees that the tokenizer package makes in terms of multi-threading. If a package is not multi-threading safe, the user needs to take care of the consequences and one shouldn’t assume it’s a bug of the package, because thread-safe operation has overheads, especially in those interpreter languages as Python with one GIL and if you only want to use one thread, this overhead shouldn’t be the default.

@jackhodkinson : Thank you very much for a reproducible! @Narsil: Thanks for tackling the issue so super fast. Will check when back from holiday 💯

do you mind sharing for other users maybe ?

Instead of loading the tokenizer before the thread fork, load it afterwards.

If you use torch.Dataset for instance it means loading the tokenizer in Dataset.__init__, instead of passing it.

Did you try not sharing the tokenizer among multiple threads ? (The easiest way to to load the tokenizer on each thread instead ?)

There are some implemented protection, but there is only so much that the lib can do against that.

This happens in TokenizerFast for me. Workaround is not using that.

The application runs in a Docker container with gunicorn like:

$ gunicorn --workers 1 --threads 2 --worker-class gthread