OpenSearch: [BUG] Performance regression in `top_hits` aggregation
I believe I have isolated a regression in performance of the top_hits aggregation between ES 7.9 and ES 7.10.2 (which means the regression is present in OpenSearch 1.0 as well).
To Reproduce
I added the following operation to the default nyc_taxis track in es rally:
{
"name": "top_hits_agg",
"warmup-iterations": 10,
"iterations": 50,
"clients": 2,
"operation": {
"operation-type": "search",
"body": {
"size": 0,
"query": {
"range": {
"dropoff_datetime": {
"from": "2015-01-01 00:00:00",
"to": "2015-01-01 12:00:00"
}
}
},
"aggs": {
"2": {
"terms": {
"field": "payment_type",
"size": 1000
},
"aggregations": {
"3": {
"date_histogram": {
"field": "dropoff_datetime",
"fixed_interval": "5s"
},
"aggregations": {
"1": {
"top_hits": {
"size": 1,
"sort": [
{
"dropoff_datetime": {
"order": "desc"
}
}
]
}
}
}
}
}
}
}
}
}
}
After populating both a 7.9 and 7.10.2 cluster with the nyc_taxis dataset, I then ran a race with the following commands:
esrally race --pipeline=benchmark-only --distribution-version=7.9.0 --track=nyc_taxis --target-host=127.0.0.1:9790 --report-file=es_790_report_$(date +%s) --user-tag='version:7.9,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg
esrally race --pipeline=benchmark-only --distribution-version=7.10.2 --track=nyc_taxis --target-host=127.0.0.1:9710 --report-file=es_710_report_$(date +%s) --user-tag='version:7.10.2,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg
Expected behavior
The expected behavior is that there will be very similar performance for the top_hits aggregation between the two versions of the software. The following data compares ES7.9 as baseline and ES7.10.2 as the contender. As expected, the performance of the date_histogram_agg operation is more-or-less at parity between the two. However, the top_hits aggregation performance is consistently 20-30% worse in 7.10.2 than 7.9:
| Metric | Task | Baseline | Contender | Diff | Unit |
|---|---|---|---|---|---|
| Cumulative indexing time of primary shards | 125.509 | 110.375 | -15.1333 | min | |
| Min cumulative indexing time across primary shard | 125.509 | 110.375 | -15.1333 | min | |
| Median cumulative indexing time across primary shard | 125.509 | 110.375 | -15.1333 | min | |
| Max cumulative indexing time across primary shard | 125.509 | 110.375 | -15.1333 | min | |
| Cumulative indexing throttle time of primary shards | 0 | 0 | 0 | min | |
| Min cumulative indexing throttle time across primary shard | 0 | 0 | 0 | min | |
| Median cumulative indexing throttle time across primary shard | 0 | 0 | 0 | min | |
| Max cumulative indexing throttle time across primary shard | 0 | 0 | 0 | min | |
| Cumulative merge time of primary shards | 41.7306 | 47.5379 | 5.80733 | min | |
| Cumulative merge count of primary shards | 93 | 89 | -4 | ||
| Min cumulative merge time across primary shard | 41.7306 | 47.5379 | 5.80733 | min | |
| Median cumulative merge time across primary shard | 41.7306 | 47.5379 | 5.80733 | min | |
| Max cumulative merge time across primary shard | 41.7306 | 47.5379 | 5.80733 | min | |
| Cumulative merge throttle time of primary shards | 4.12925 | 4.25008 | 0.12083 | min | |
| Min cumulative merge throttle time across primary shard | 4.12925 | 4.25008 | 0.12083 | min | |
| Median cumulative merge throttle time across primary shard | 4.12925 | 4.25008 | 0.12083 | min | |
| Max cumulative merge throttle time across primary shard | 4.12925 | 4.25008 | 0.12083 | min | |
| Cumulative refresh time of primary shards | 2.60575 | 2.36 | -0.24575 | min | |
| Cumulative refresh count of primary shards | 82 | 79 | -3 | ||
| Min cumulative refresh time across primary shard | 2.60575 | 2.36 | -0.24575 | min | |
| Median cumulative refresh time across primary shard | 2.60575 | 2.36 | -0.24575 | min | |
| Max cumulative refresh time across primary shard | 2.60575 | 2.36 | -0.24575 | min | |
| Cumulative flush time of primary shards | 5.45558 | 5.48922 | 0.03363 | min | |
| Cumulative flush count of primary shards | 35 | 34 | -1 | ||
| Min cumulative flush time across primary shard | 5.45558 | 5.48922 | 0.03363 | min | |
| Median cumulative flush time across primary shard | 5.45558 | 5.48922 | 0.03363 | min | |
| Max cumulative flush time across primary shard | 5.45558 | 5.48922 | 0.03363 | min | |
| Total Young Gen GC time | 0.258 | 1.24 | 0.982 | s | |
| Total Young Gen GC count | 32 | 136 | 104 | ||
| Total Old Gen GC time | 0 | 0 | 0 | s | |
| Total Old Gen GC count | 0 | 0 | 0 | ||
| Store size | 25.0414 | 24.3655 | -0.67586 | GB | |
| Translog size | 5.12227e-08 | 5.12227e-08 | 0 | GB | |
| Heap used for segments | 0.193222 | 0.0988235 | -0.0944 | MB | |
| Heap used for doc values | 0.0354309 | 0.034874 | -0.00056 | MB | |
| Heap used for terms | 0.0408325 | 0.0370789 | -0.00375 | MB | |
| Heap used for norms | 0 | 0 | 0 | MB | |
| Heap used for points | 0 | 0 | 0 | MB | |
| Heap used for stored fields | 0.116959 | 0.0268707 | -0.09009 | MB | |
| Segment count | 32 | 30 | -2 | ||
| Min Throughput | date_histogram_agg | 1.50023 | 1.50023 | -1e-05 | ops/s |
| Median Throughput | date_histogram_agg | 1.50035 | 1.50035 | -0 | ops/s |
| Max Throughput | date_histogram_agg | 1.50054 | 1.50054 | 0 | ops/s |
| 50th percentile latency | date_histogram_agg | 548.563 | 549.301 | 0.73863 | ms |
| 90th percentile latency | date_histogram_agg | 557.053 | 559.417 | 2.36497 | ms |
| 99th percentile latency | date_histogram_agg | 571.816 | 569.581 | -2.23542 | ms |
| 100th percentile latency | date_histogram_agg | 578.69 | 575.319 | -3.37139 | ms |
| 50th percentile service time | date_histogram_agg | 547.429 | 548.28 | 0.85141 | ms |
| 90th percentile service time | date_histogram_agg | 555.608 | 558.266 | 2.65809 | ms |
| 99th percentile service time | date_histogram_agg | 570.955 | 568.385 | -2.57038 | ms |
| 100th percentile service time | date_histogram_agg | 577.466 | 573.309 | -4.1576 | ms |
| error rate | date_histogram_agg | 0 | 0 | 0 | % |
| Min Throughput | top_hits_agg | 0.720015 | 0.559013 | -0.161 | ops/s |
| Median Throughput | top_hits_agg | 0.747501 | 0.577278 | -0.17022 | ops/s |
| Max Throughput | top_hits_agg | 0.759315 | 0.582273 | -0.17704 | ops/s |
| 50th percentile latency | top_hits_agg | 2569.6 | 3373.35 | 803.747 | ms |
| 90th percentile latency | top_hits_agg | 2609.98 | 3455.39 | 845.411 | ms |
| 99th percentile latency | top_hits_agg | 2622.45 | 3502.29 | 879.836 | ms |
| 100th percentile latency | top_hits_agg | 2694.13 | 3503.48 | 809.352 | ms |
| 50th percentile service time | top_hits_agg | 2569.6 | 3373.35 | 803.747 | ms |
| 90th percentile service time | top_hits_agg | 2609.98 | 3455.39 | 845.411 | ms |
| 99th percentile service time | top_hits_agg | 2622.45 | 3502.29 | 879.836 | ms |
| 100th percentile service time | top_hits_agg | 2694.13 | 3503.48 | 809.352 | ms |
| error rate | top_hits_agg | 0 | 0 | 0 | % |
Plugins None
Host/Environment (please complete the following information):
- OS: Ubuntu 20.04
- Version ES 7.9 and ES 7.10.2
- Host type: c6i.8xlarge
Additional context The elasticsearch distributions are just the stock versions downloaded as tar files, extracted and started with the following command:
ES_JAVA_OPTS="-Xms8g -Xmx8g" ./bin/elasticsearch
The only non-default setting in the yml configuration file is the port number. Both instances are running on the same machine, though there load is only driven one at a time (not concurrently). The service is bound to localhost and es rally is running on the same machine.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21 (10 by maintainers)
I believe this Lucene issue actually does fix the regression. I didn’t realize that the testing tool was setting
BEST_COMPRESSIONby default for the given workload, but when I run the test against in index that has been configured with index.codec=default then the regression nearly completely disappears.Test script:
Note that
BEST_COMPRESSIONis in fact quite a bit slower on new versions as compared to ES 7.9, but that’s because the new Lucene codec has been changed to significantly improve compression at the cost of slower performance. The performance change is minimal when comparing the versions using the default codec (BEST_SPEED).@andrross so the change came with https://github.com/opensearch-project/OpenSearch/commit/31c026f25ccbbc02bb643351338786297a2dff6a where the upgrade to Lucene 8.7.0 happened and basically switched to new compression.
I did experiment and rebuilt OpenSearch 2.0 SNASPHOT with Lucene86Codec, the latencies went back to Elasticsearch 7.9.x levels:
Elasticsearch 7.9.x
Opensearch 2.0 (SNAPSHOT)
Opensearch 2.0 (SNAPSHOT) with Lucene86Codec
I actually don’t know what we should be doing in this case, the change seems to be related to the
Lucene87Codecimplementation.@reta I’ve been benchmarking against OpenSearch 1.2, which is using Lucene 8.10.1 and it shows comparable performance to all the ES versions greater than 7.9.
Here’s the quick test I’m running against the 4 versions on my machine (all populated with the nyc_taxi dataset but otherwise idle). It consistently produces results like the following:
Also, I’ve tested this “top hits” query against ES7.11 and ES7.15 and both appear to exhibit the regressions relative to ES7.9.
I profiled ES7.9 versus OpenSearch 1.2. It looks like OS1.2 is possibly spending more time doing decompression.
I do indeed get different results for the query in question here, so it looks like the referenced bug may be a factor.