tokenizers: RuntimeError: Already borrowed
We’re using transformers
(3.5.0) with a fast tokenizer (0.9.3) in production, but sometimes a RuntimeError
with Already borrowed
is raised (this might come from Rusts’s borrowing mechanisms?). This happens actually quite often, but I’m not sure yet why and how to reproduce this.
However, this is where the error is raised:
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 19
- Comments: 46 (3 by maintainers)
Links to this issue
- Fix: Create engine and tokenizer on each call instead of using singletons (!229) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / Model Gateway · GitLab
- Fix: Create engine and tokenizer on each call instead of using singletons (!229) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / Model Gateway · GitLab
Commits related to this issue
- This will reduce "Already borrowed error": Original issue https://github.com/huggingface/tokenizers/issues/537 The original issue is caused by transformers calling many times mutable functions on th... — committed to Narsil/transformers by Narsil 3 years ago
- This will reduce "Already borrowed error": (#12550) * This will reduce "Already borrowed error": Original issue https://github.com/huggingface/tokenizers/issues/537 The original issue is caused... — committed to huggingface/transformers by Narsil 3 years ago
The error still exists in: transformers==4.3.2, tokenizers==0.10.1. I am using gunicorn (with threads) with flask and the error shows if parallel requests are made.
The problem does not exist in transformers==3.0.2, tokenizers==0.8.1.
Alright, that’s what I feared. This is happening because you have a single tokenizer, that is used by 2 different threads. While the tokenizer is encoding (on one thread), if the other thread tries to modify it, this error happens because it cannot be modified while being used at the same time.
I think the easiest way to fix it, for now, will be to ensure you have an instance of the tokenizer for each thread.
We should be able to fix this in
transformers
by making sure we update the truncation/padding info only if necessary (cc @LysandreJik @thomwolf). And we should also be able to improve this error to make it clearer ontokenizers
.For those who may not be able to use the latest branch of this repository due to experimental work or other custom modifications: Wrapping the request into a mutex acquire/release statement does the job as well, as done here.
Good discussion. But I don’t quite understand why this truncation/padding info has to be global. It can be passed as a parameter so that each tokenize call will be threadsafe.
Yes, you cannot do this.
tokenizer
is thread-safe, but not meant to be used concurrently (hence the error which safe 2 threads are trying to access the same thing at the same time, which is not allowed)This works for instance (each thread gets its own copy of the tokenizer).
In the case where you are reusing threads for more tasks:
should work, and each thread will get its tokenizer.
Sharing tokenizer across threads is fixable but not desirable, it will just slow everything down since it’s likely we’ll just mutex it causing each thread to wait their turn for each other. Given that tokenizers are relatively small objects, making each thread have its own seems better.
Lock-free sharing is just too complex for what it would bring (and prevent ANY modification of the underlying tokenizer which is what you are doing without realizing).
tokenizer(...)
andtokenizer(..., padding="max_length")
need to modify the underlying object since the padding strategy is part of it.As a side note, another way to fix it (which I don’t recommend) is:
I want to add a comment to illustrate a specific example for which we found a workaround. We also faced this error when running preprocessing on aiohttp API with concurrent requests. Neither #12550 nor setting
TOKENIZERS_PARALLELISM=0
helped with it. Our preprocessing logic is made of 2 steps:Here is a full example to reproduce the error:
Even a parallelism of
2
is enough to trigger theRuntimeError: Already borrowed
.The warkaround we foud for this situation is to create 2 seprate instance of tokenizers, one for each truncation/padding configuration:
tokenizer_a
for tokenizing without padding/truncationtokenizer_b
for tokenizing with padding max_length and truncation ignoredSo by changing the code as the following we no more have this error even with more concurrency:
I hope this may be helpful for some of you.
@Narsil - I can confirm the observation of @oborchers
I can reproduce with these two:
If you change the client to fetch only 1 coro you do not hit the error. But if you have 2 you get
RuntimeError: Already borrowed
This may come incredibly late, but if you are working with micro-services and are willing to exchange the base call by a post request I would much rather suggest to:
All my scaling and threading headaches when working with this in pure fastapi/flask fashion are resolved since then.
It’s about the choice that was done about padding_strategy.
Making it stateless means that every single call from python to rust needs to pass it to the caller. Meaning there’s is a string passing the Python->Rust boundary for every single call.
It turns out that Python -> Rust is not a free boundary, some calculations have to happen. We didn’t make actual measurements, but it could be quite hurtful to make rust purely stateless.
Since in most cases users use either padding or no padding strategy (usually training vs inference) then being it stateful is correct in most cases. The last version showcases how to actually have only 2 stateless tokenizers.
Hope that helps.
asyncio
doesn’t change anything to how your example should fail. It’s the threading that’s causing issues, not async (sincetokenizers
will block the thread anyway)Thanks for providing a solid testing script @jackhodkinson I have created a PR within
transformers
to reduce the amount of such errors: https://github.com/huggingface/transformers/pull/12550Unfortunately, there’s no way to completely erase those errors without a major revamp of the
encode
function astruncation
andpadding
are part of the core struct of atokenizer
. I think it should cover 99% of the cases though because padding and truncation options, shouldn’t ever be changed that often in reality.Please read the PR for more details about what the problem is and how it attempts to solve it.
I am having the same problem. Simple reproduction would be:
Hi, I have the same problem with gunicorn. For some models, it does work but for others it fails. I notice a difference between the 2 models:
This fails:
self.token_indexer.encode(x, max_length=350, truncation=True)
This seems to work:self.token_indexer.encode(x, truncation=True)
The tokenizer is loaded at startup in guinicorn. When I receive a request, I try to tokenize the batch of text (probably in an another thread). Is it because the set_truncation_and_padding function tries to modify the backend tokenizer (
self._tokenizer
) which is already owned by the first thread? In the second case (which work) the _tokenizer is not modified because max_length is at default.Could we pass this as an argument of the backend encoding function instead of modifying the backend tokenizer object?
I solved this problem using a thread lock in Python.
Did I understand correctly that it is not a bug, but a slight misunderstanding of the non-thread safe nature of the python->rust boundary? Should it be closed then? Or maybe, as a part of a fix, one should compute what would it cost to make the calls stateless?
The easiest and least intrusive way IMHO is using a Python
queue
, which is multi-threaded per-se.Let’s assume you have
N
threads, instead of creating one tokenizer instance per thread, one createsM
tokenizer instances, whereM
could only be 1 as a default value - which is equivalent of using a simple lock. Inside the initialization you putM
instances of your tokenizer into the queue and afterwards only usequeue.get()
andqueue.put()
when you need to access any of these instances.The latter should be done inside a
try: .. finalize:
block, so that a tokenizer always is guaranteed to be returned into the queue again e.g. in case of exceptions.queue.get()
will block the calling thread as long as there are no available free tokenizer instances and will immediately unblock the thread if another thread puts a tokenizer back to the queue. As Python queues are FIFO’s, it’s also guaranteed that all elements inside the queue are used round-robin.The necessary code is minimal and always thread-safe and you can decouple the number of your threads from the number of your tokenizer instances. This makes the resource usage very controllable as well.
Pros & Cons:
The problem with the above approach, is that as long as
M
is less thanN
, there will be thread-contention in heavy load situations. Most normal operating systems don’t make any guarantees for waiting threads to be scheduled in FIFO-order. This means, there is no latency guarantee for e.g. your gRPC or webserver thread to get hold of a tokenizer instance before another thread that took the queue later. In most cases this is not an issue, but if your server is under heavy load, that’s the reason why you often see high latency spikes. There is a reason there are realtime operating systems out there that make those guarantees.I.e. if you are requiring strict latency timelines, you need to have
M
==N
.side-note
The issue here is not a tokenizer bug, it’s a misunderstanding of the user about the guarantees that the tokenizer package makes in terms of multi-threading. If a package is not multi-threading safe, the user needs to take care of the consequences and one shouldn’t assume it’s a bug of the package, because thread-safe operation has overheads, especially in those interpreter languages as Python with one GIL and if you only want to use one thread, this overhead shouldn’t be the default.
@jackhodkinson : Thank you very much for a reproducible! @Narsil: Thanks for tackling the issue so super fast. Will check when back from holiday 💯
do you mind sharing for other users maybe ?
Instead of loading the tokenizer before the thread fork, load it afterwards.
If you use torch.Dataset for instance it means loading the tokenizer in
Dataset.__init__
, instead of passing it.Did you try not sharing the tokenizer among multiple threads ? (The easiest way to to load the tokenizer on each thread instead ?)
There are some implemented protection, but there is only so much that the lib can do against that.
This happens in TokenizerFast for me. Workaround is not using that.
The application runs in a Docker container with
gunicorn
like: