magento2: Asynchronous indexing can break large servers

Large stores with extremely active product databases can suffer many related problems with indexers not being kept up to date. Products can disappear from the website, product data can be out of date on the website, excessive cache flushing can cause huge server load spikes.

This is due to the manner in which triggers are created by the core magento code in https://github.com/magento/magento2/blob/9544fb243d5848a497d4ea7b88e08609376ac39e/lib/internal/Magento/Framework/Mview/View/Subscription.php#L201

INSERT INTO will insert a new record regardless of whether one exists or not. REPLACE INTO will as the name says, replace a record - and in this case it will generate a new version_id since that is an auto_increment field.

Preconditions (*)

An extremely large database. Example 500,000 products
An active update schedule

Steps to reproduce (*)

Create a sample database with 500,000 products
Enable Asynchronous indexing
Reduce crons to every 10 minutes
Run a mass script to change the inventory for the products every 30 seconds. So within 10 minutes, you will have made 10 million updates

Expected result (*)

When the indexer runs, it will index the latest product information once for each index

Actual result (*)

It will actually try to index each product 20 times, once for each update

Additional information to reproduce the issue

Change the price of the product with SKU 24-MB01 20 times via API rest/all/V1/products/24-MB01
Doing the price change data inserted in the table catalog_product_price_cl
Adjust the DEFAULT_BATCH_SIZE in file lib/internal/Magento/Framework/Mview/View.php with 10 from 1000.
Enable the DB logs via the bin/magento dev:query-log:enable command.
grep the text to get the SQL query on table catalog_product_price_cl. Please find attached the screenshot of db.log:

The query always returns entity_id 1 each time.

In this case, it seems that entity_id 1 will process many times, instead of only 1 time, and should be on the latest one.

Please provide Severity assessment for the Issue as Reporter. This information will help during Confirmation and Issue triage processes.

Severity: S0 - Affects critical data or functionality and leaves users without workaround.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 9
Comments: 54 (34 by maintainers)

Most upvoted comments

Sigh …

@sidolov, @ihor-sviziev : can we please delete this stalebot crappy thing? What’s the use of people opening issues if nobody looks at them and then they get auto closed? That’s just depressing and might scare people away from opening more issues in the future …

(not saying that this issue is valid, but it feels like it should at least be investigated instead of auto closed)

hostep on Apr 8, 2021

We’ve build a module to address this issue some years ago: https://github.com/ho-nl/magento2-ReachDigital_BetterIndexer

Changes the query used: https://github.com/ho-nl/magento2-ReachDigital_BetterIndexer/blob/master/src/Plugin/ChangelogPlugin.php#L39-L44
Has better crash recovery if the indexers fails on a certain batch, we assume previous batches are fine: https://github.com/ho-nl/magento2-ReachDigital_BetterIndexer/blob/master/src/Plugin/ViewPlugin.php#L60

And a few other fixes I don’t specifically remember.

paales on Apr 3, 2023

Some status update in case people are interested.

We’ve been recently trying out quality patch from Adobe which is also supposed to fix this issue. We’ve been running it on a Magento 2.4.3-p3 shop for a couple of days, at first look it seems to work fine, but it’s now been 2 days in a row that we’ve noticed that the index cronjob is just sitting there for hours not doing anything while waiting on the following SQL query to finish:

DELETE cl_duplicate
FROM catalog_product_attribute_cl AS cl_duplicate,
     catalog_product_attribute_cl AS cl_origin
WHERE cl_duplicate.version_id < cl_origin.version_id
  AND cl_duplicate.version_id > 93264739
  AND cl_duplicate.version_id <= 93496654
  AND cl_origin.version_id > 93264739
  AND cl_origin.version_id <= 93496654
  AND cl_duplicate.entity_id <=> cl_origin.entity_id

When I attempt to run this query on my local machine with dbdump from the server, it’s also just running for many many minutes without seemingly to progress in any way (I’ve stopped it after more than 10 minutes). The catalog_product_attribute_cl table in our case contains 226253 rows and an EXPLAIN query returns this:

mysql> EXPLAIN DELETE cl_duplicate FROM catalog_product_attribute_cl AS cl_duplicate, catalog_product_attribute_cl AS cl_origin WHERE cl_duplicate.version_id < cl_origin.version_id AND cl_duplicate.version_id > 93264739 AND cl_duplicate.version_id <= 93496654 AND cl_origin.version_id > 93264739 AND cl_origin.version_id <= 93496654 AND cl_duplicate.entity_id <=> cl_origin.entity_id;
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table        | partitions | type  | possible_keys | key     | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
|  1 | DELETE      | cl_duplicate | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL | 113126 |   100.00 | Using where |
|  1 | SIMPLE      | cl_origin    | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL | 113126 |    10.00 | Using where |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
2 rows in set (0.00 sec)

(I’m not a db expert, so I can’t really interpret these results, so just providing for extra info)

So it sounds like that quality patch is not fully fixing this issue here, I think we’ll switch back to the community PR, that one seems to work better (which we have been running for some weeks on another project, using Magento 2.4.5-p2)

hostep on May 17, 2023

I’ve tested the PR as well and it seems to be even more efficient than the ReachDigital_BetterIndexer module.

With the same dataset, the PR only takes 4 loops and fills up the loops with 1000 items always (except for the last loop which just contains the remainder). The ReachDigital_BetterIndexer module however didn’t fill up every loop with 1000 items, only the first and third loop contained 1000 items in my case, the others were ~650, ~750 and ~450 items big.

So both are good solutions at first sight, we’ll probably try out the PR on a real project in production soon-ish, as the ReachDigital_BetterIndexer contains more than just this bugfix and I’m a bit hesitant of adding more changes then only a fix for this problem without investigating them in more detail.

Thanks to all the people who commented here with your findings and proposed solutions, it really helped. Let’s hope Adobe will also start taking this issue a bit more seriously, because it’s a really big problem in my opinion that causes a lot of unnecessary resource usage on servers for bigger Magento projects when using the partial reindexers.

Update: some actual numbers, after the PR was implemented, the partial reindexing system took 14 minutes to run fully. Without the PR, it was running for more than 2 hours, before I gave up and the indexer:status still showed that nothing had changed, so without the PR this would have taken a very very long time to finish.

hostep on Apr 6, 2023

I don’t think this was specifically covered in the initial report, but on a related note we have discovered a major functional problem with the Magento partial indexer (indexer_update_all_views) starting with Magento 2.4.3 including the latest version (2.4.6)

The indexer processes records from the changelog in batches. It gets the IDs to process here:

https://github.com/magento/magento2/blob/2.4.6/lib/internal/Magento/Framework/Mview/View.php#L308

This will generate a query like this

SELECT
DISTINCT $changeLogTable.entity_id 
FROM changeLogTable 
WHERE (version_id > $fromVersionId) AND (version_id <= $toVersionId) 
GROUP BY `entity_id` LIMIT $batchSize

This will SELECT the 1000 rows greater than the $fromVersionId all the way to $toVersionId. As a result of the GROUP BY they will also be ordered by the entity_id

Notably $toVersionId is not $fromVersionId + $batchSize. It’s the maximum version ID from the changelog table at the time when processing of the backlog begins (fetched here: https://github.com/magento/magento2/blob/2.4.6/lib/internal/Magento/Framework/Mview/View.php#L255)

So let’s say you have a large backlog greater than fromVersionId and have entity_ids starting 1,4,7 etc…

It will process those, even if they are not in the range of $fromVersionId + $batchSize

Next it will increment fromVersionId by batchSize and then run through the loop again

https://github.com/magento/magento2/blob/2.4.6/lib/internal/Magento/Framework/Mview/View.php#L313

This means with a large backlog it is not guaranteed that the entities from the previous batch are actually the ones that get processed.

I’m observing this on a project right now with a very large backlog and what it’s doing is just processing the same set of entity_ids over and over for each batch, which are the lowest ones and aren’t even the ones that are in the batch it’s supposed to be processing. Essentially none of the data is indexed correctly by the partial indexer.

mpchadwick on Apr 11, 2023

✅ Jira issue https://jira.corp.adobe.com/browse/AC-7009 is successfully created for this GitHub issue.

github-jira-sync-bot on Nov 11, 2022

Whoever interested to try adaptation of @dooblem approach use these patches:

Index: vendor/magento/framework/Mview/View.php
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/vendor/magento/framework/Mview/View.php b/vendor/magento/framework/Mview/View.php
--- a/vendor/magento/framework/Mview/View.php
+++ b/vendor/magento/framework/Mview/View.php	(date 1701886312378)
@@ -314,7 +314,21 @@
                 break;
             }
             $vsFrom += $batchSize;
-            $action->execute($ids);
+
+            //TASKID: port of https://github.com/magento/magento2/issues/30012#issuecomment-1829364973 approach
+            $nextVersionId = intval(max(array_column($ids, 'max_version_id') ?: [0]));
+            if ($nextVersionId > 0) {
+                $vsFrom = $nextVersionId;
+            }
+
+            $isMultidimensional = isset($ids[0]) && is_array($ids[0]);
+
+            if ($isMultidimensional) {
+                $action->execute(array_map('intval', array_column($ids, 'entity_id')));
+            } else {
+                $action->execute($ids);
+            }
+            $this->getState()->setVersionId($vsFrom)->save();
         }
     }

and walker using max version id and having

Index: vendor/magento/framework/Mview/View/ChangeLogBatchWalker.php
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/vendor/magento/framework/Mview/View/ChangeLogBatchWalker.php b/vendor/magento/framework/Mview/View/ChangeLogBatchWalker.php
--- a/vendor/magento/framework/Mview/View/ChangeLogBatchWalker.php
+++ b/vendor/magento/framework/Mview/View/ChangeLogBatchWalker.php	(date 1701886718709)
@@ -40,8 +40,8 @@
         if (!$connection->isTableExists($changelogTableName)) {
             throw new ChangelogTableNotExistsException(new Phrase("Table %1 does not exist", [$changelogTableName]));
         }
-
-        $select = $connection->select()->distinct(true)
+        //TASKID: adaptation of https://github.com/magento/magento2/issues/30012#issuecomment-1829364973 approach
+        $select = $connection->select()
             ->where(
                 'version_id > ?',
                 $fromVersionId
@@ -50,10 +50,16 @@
                 'version_id <= ?',
                 $toVersion
             )
+            //TASKID: adaptation of https://github.com/magento/magento2/issues/30012#issuecomment-1829364973 approach
+            ->having('max_version_id > ?', $fromVersionId)
+            ->having('max_version_id <= ?', $toVersion)
             ->group([$changelog->getColumnName()])
+            //TASKID: adaptation of https://github.com/magento/magento2/issues/30012#issuecomment-1829364973 approach
+            ->order('max_version_id ASC')
             ->limit($batchSize);
+        //TASKID: adaptation of https://github.com/magento/magento2/issues/30012#issuecomment-1829364973 approach
+        $select->from($changelogTableName, [$changelog->getColumnName(), 'max_version_id' => new \Zend_Db_Expr('MAX(version_id)')]);

-        $select->from($changelogTableName, [$changelog->getColumnName()]);
-        return $connection->fetchCol($select);
+        return $connection->fetchAll($select);
     }
 }

To me works like a rocket. We have about 5K products but update them frequently so changelogs are huge more that 200K, so now instead of walking CL by 1000 any partial reindex can process the whole catalog just in 5 iteration. Additionally after every iteration the Mview version is saved to DB.

ilnytskyi on Dec 6, 2023

Also on crash it will not leave the indexer in a processing state: https://github.com/ho-nl/magento2-ReachDigital_BetterIndexer/blob/master/src/Plugin/ViewPlugin.php#L197

Just FYI in case you weren’t aware of this as this is a well hidden feature in Magento, but this can also be fixed by adding the following in your app/etc/env.php file since Magento 2.4.3 (it’s a completely different approach but should more or less give the same end result):

    'indexer' => [
        'use_application_lock' => true,
    ]

Documentation: https://developer.adobe.com/commerce/php/development/components/indexing/#using-application-lock-mode-for-reindex-processes

hostep on Apr 6, 2023

Since it is rather hidden in this issue, I would like to mention again that there is a promising pull request already, which might fix the issue.

sprankhub on Mar 31, 2023

Hello @hostep,

Let me reopen this for further investigation.

Thanks

engcom-Hotel on Aug 26, 2022

Hi, sorry to respond on a closed issue, but I think this is still relevant and I didn’t found any other related issue.

The problem clearly comes from the ChangeLogBatchWalker::walk() function.

It misses a limit offset so it actually loops multiple times on the same values. Here is a simple example:

Fill your changelog table with 10000 rows of only 3 distinct entity_ids (1, 2, 3) Default batching is 1000.

walk() function will perform a SELECT DISTINCT on rows 0 to 10000 with LIMIT 1000

it will return: 1, 2, 3

on second call, it will perform a SELECT DISTINCT on rows 1000 to 10000 with LIMIT 1000

which also returns 1, 2, 3 …

same for 8 more useless calls!

the easy fix would be to add an offset argument to the walk function so, on second call, we could perform a SELECT DISTINCT on rows 0 to 10000 with LIMIT 1000 OFFSET 1000 which will return, in our example, no result, and end the walk loops

Hope it helps, Regards

xhoaraujx on Aug 16, 2022

Hello, @gamort. I’ve checked this issue on the 2.4-develop branch and seems like it’s already fixed. Could you please check it on the latest codebase?

engcom-Foxtrot on Apr 13, 2021

Hello @gamort @hostep @bigbangx,

Thanks for your input on this issue. I have some findings and would like to share them here. In order to try to reproduce the issue, I have followed the below steps:

Change the price of the product with SKU 24-MB01 20 times via API rest/all/V1/products/24-MB01
Doing the price change data inserted in the table catalog_product_price_cl
Adjust the DEFAULT_BATCH_SIZE in file lib/internal/Magento/Framework/Mview/View.php with 10 from 1000.
Enable the DB logs via the bin/magento dev:query-log:enable command.
grep the text to get the SQL query on table catalog_product_price_cl. Please find attached the screenshot of db.log:

The query always returns entity_id 1 each time.

In this case, it seems that entity_id 1 will process many times, instead of only 1 time, and should be on the latest one.

Thanks

engcom-Hotel on Nov 10, 2022

Having had this crushing our relatively small store (few thousand products, a dozen customer groups, a dozen catalog price rules) by hanging up the process for indexer_update_all_views for hours: I’m trying a different approach.

Rather than building out a massive batch walker update to deal with multiple version_ids per entity_id and get the max for each one, I’m updating the generic table definitions and triggers so that:

The entity_id is inserted if it isn’t present already
If it already exists the version_id is updated to the newest auto_increment value since all we really care about is that the “latest” version_id in the changelog table for an entity_id is greater than the pointer for that table stored in mview_state.

Yes, there’s additional overhead, but it’s orders of magnitude less than churning the indexers constantly. This still required manual intervention listed below to deploy, but until the index is added the code continues to function similarly to Magento’s current implementation.

Setting indexers to “update on save”
Flushing cache
Deleting the existing _cl tables manually
Reset the mview_state.version_id to 0 for all indexers
Setting the indexers back to “update by schedule” to recreate the _cl tables
Invalidating existing indexes and reindexing to ensure all changes are captured

fix-nonunique-changelog-indexers.txt

CHallski on Nov 9, 2022

I believe this issue is still valid, could you still try to reproduce this @engcom-Hotel?

What you should look for (if I understand the issue correctly) is for duplicated entity_id’s in the changelog tables (like catalogsearch_fulltext_cl or catalogrule_product_cl or catalog_product_attribute_cl, …) and then figure out if when the partial reindexing system runs, if the same entity_id is being processed once or more then once. The expectation is that the same entity_id is only processed once in a single partial reindex run.

Thanks!

hostep on Oct 22, 2021

Hello @gamort,

We have noticed that this issue has not been updated for a period of 14 Days. Hence we assume that this issue is fixed now, so we are closing it. Please raise a fresh ticket or reopen this ticket if you need more assistance on this.

Regards

engcom-Hotel on Oct 22, 2021

@hostep , @ihor-sviziev we will take a look at it and prioritize in case it’s still reproducible

sidolov on Apr 9, 2021

The ReachDigital_BetterIndexer module however didn’t fill up every loop with 1000 items, only the first and third loop contained 1000 items in my case, the others were ~650, ~750 and ~450 items big.

Not sure if there is any perf difference measurable the batch size is choosen arbitrarily at 1000, so no guaranteers that more will be faster.

as the ReachDigital_BetterIndexer contains more than just this bugfix

Yeah it contains another important step, it can be that memory becomes an issue when running very large batches or faulty indexes.

By default it will only save the indexer staat after everything is indexed: https://github.com/magento/magento2/blob/9544fb243d5848a497d4ea7b88e08609376ac39e/lib/internal/Magento/Framework/Mview/View.php#L309

In de indexer it will save after every batch, which makes recovering a lot easier. https://github.com/ho-nl/magento2-ReachDigital_BetterIndexer/blob/master/src/Plugin/ViewPlugin.php#L167-L169

Also on crash it will not leave the indexer in a processing state: https://github.com/ho-nl/magento2-ReachDigital_BetterIndexer/blob/master/src/Plugin/ViewPlugin.php#L197

paales on Apr 6, 2023

An explanation from a colleague which maybe helps clarify if there is confusion

suppose we had this list: 1, 9, 2, 5, 20, 3
and we used a batch size of 2
first, the broken logic would process 1,2 and then skip 9
next, it would process 2,3 and skip 5
finally it would process 3,20

mpchadwick on Mar 30, 2023

@ihor-sviziev: can this be re-opened and re-validated? Now that there is more info.

hostep on Aug 25, 2022

Hello @hostep and @ihor-sviziev

I’ll share my final conclusion after some test.

I migrated from 2.3.4 to 2.4.3-p1 recently. After a full product and stock import, I have some very long scheduled index and even some are stuck forever. It creates 800k line with only 30k distinct for “catalogrule_product_cl” as an Example.

(I have 0 rule in place, but the indexer takes more than 3 days, never finished… this is another issue, a foreach product than a foreach rule)

But for me, by reading the source code,

There is an issue when the mview is being updated if the same entity_id is present in two different chunk of the batch size. On 2.3.4, they first retrieve the max 100k versions, make them unique and batch on them

2.3.4 https://github.com/magento/magento2/blob/5f3b86ab4bd3e3b94e65429fed33f748d29c1bbe/lib/internal/Magento/Framework/Mview/View.php#L292-L311

https://github.com/magento/magento2/blob/5f3b86ab4bd3e3b94e65429fed33f748d29c1bbe/lib/internal/Magento/Framework/Mview/View/Changelog.php#L145-L166

3.4.3-p1

https://github.com/magento/magento2/blob/8afdb9858d238392ecb5dbfe556068ec1af137bc/lib/internal/Magento/Framework/Mview/View.php#L299-L315

https://github.com/magento/magento2/blob/8afdb9858d238392ecb5dbfe556068ec1af137bc/lib/internal/Magento/Framework/Mview/View/ChangeLogBatchWalker.php#L35-L58

Franck.

fgarnier-hublot on Nov 25, 2021

@hostep, yeah, that’s my point - it shouldn’t reproduce.

ihor-sviziev on Oct 22, 2021

Yeah indeed, I would indeed handle making them unique when reading and not when writing.

But it appears this already happens like you indicate due to the distinct call, so maybe the issue is no longer reproducible at this time?

hostep on Oct 22, 2021