kombu: Slow SQS task processing due to `wait_time_in_seconds` regression v5.3.2
We recently update from Kombu 5.3.1 to 5.3.2 and saw a very dramatically decrease in the number of tasks that can be processed. Here is an example for a single worker processing a full queue - both graphs in number of messages deleted (processed) a minute.
V5.3.1 (about 500 messages a minute):
Here is the same exact worker and messages being processed on v5.3.2 (about 5 messages a minute):
We traced this down to just the upgrade of Kombu. When diving into detail we found that the throughput limitation on 5.3.2 was solely due to one setting wait_time_in_seconds
. When we looked into the logs we found that after a message was processed (which only took 100ms) the worker sat idle for 9.9 seconds, before processing the next task, even if the queue is completely full. Sure enough the wait_time_in_seconds
was set to 10 seconds. On 5.3.1 Kombu correctly realizes there is more on the queue and doesn’t wait the wait_time_in_seconds
to process the next message.
We are using Amazon SQS, celery[sqs] v5.2.7, here are the broker settings:
CELERY_BROKER_TRANSPORT_OPTIONS = {
"region": "us-west-2",
"visibility_timeout": 14400,
"polling_interval": 1,
'wait_time_seconds': 10
}
Let me know if there is anything else I can provide.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 8
- Comments: 24 (15 by maintainers)
Commits related to this issue
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Borrowing botocore implementation of get_response TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. ... — committed to rafidka/kombu by rafidka 9 months ago
- Use the correct protocol for SQS requests TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. `kombu`... — committed to rafidka/kombu by rafidka 9 months ago
- Use the correct protocol for SQS requests TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. `kombu`... — committed to rafidka/kombu by rafidka 9 months ago
- Use the correct protocol for SQS requests TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. `kombu`... — committed to rafidka/kombu by rafidka 9 months ago
- Use the correct protocol for SQS requests (#1807) * Use the correct protocol for SQS requests TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which ... — committed to celery/kombu by rafidka 7 months ago
I have an update on this issue that explains the issue @dtorres-sf reported.
TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in this issue. Continue reading for detailed discussion and proposals.
kombu
previously used to craft AWS requests manually as explained in detail in #1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted #1759, which changes kombu to use boto3 instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won’t be impacted.While working on #1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn’t an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn’t appear to me that there is anything to worry about my change, and the testing I did didn’t discover any issue. However, it turns out that while kombu’s event loop doesn’t have actual multi-threading, its reliance on pycurl (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests.
The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS’s ReceiveMessage can last up to 20 seconds (depending on the user configuration).
To solve this problem, I worked on a non-blocking boto3 client and currently testing it and fixing some issues. Assuming all goes well and it works fine, I will push the changes to my local repository and you can test it.
Goin to release a new patch version today
we might revert that to temporarily fix it https://github.com/celery/kombu/pull/1799
Can confirm we received the same problems as @samueltrevena above