magento2: 2.3 Customer module Recurring setup script performance problems.

When running bin/magento setup:upgrade for a Magento CE 2.3.x installation(or just use Magento Open source), there is an unexpected delay in the recurring setup script execution on the Magento_Customer module(every time when you run bin/magento setup:upgrade) . This is more pronounced on a large data set (>500K customers).

References

Preconditions (*)

  1. Magento CE 2.2.x(or 2.3.x) -> 2.3.x upgraded codebase (pre DB upgrade)
  2. A large customer database (>500K records).

Steps to reproduce (*)

  1. After codebase upgrade, proceed to run bin/magento setup:upgrade
  2. Observe execution delay on process step:
Module 'Magento_Customer':
Running data recurring...

Repeat these steps and you will notice, since there is a recurring upgrade script, that it runs every time.

Expected result (*)

  1. No recurring data scripts run, or they are or more performant.

Actual result (*)

  1. Recurring data scripts run with each attempt to upgrade the DB.

After ending of update you can run again bin/magento setup:upgrade and you will meet this problem again. I am not sure of the need/reason to run a recurring upgrade, but from the reference posted at the top of this issue it’s clear the intent to is to handle reindexing on upgrades. This seems unwise and gives room for abusing recurring upgrade scripts with patch-like behavior or long-running processes which can delay deployment times.

Do you have any background regarding the nature of the change?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 26
  • Comments: 59 (33 by maintainers)

Most upvoted comments

Let’s celebrate the first birthday of this issue 🎂 🎉 🎉

If you have enableb cron (that also natural for prod environments) reindex wil be performed in bacground for invalid index, so case when you will perform setup:upgrade with invalid index is very rare and for this case reindex will be performed (that is ok for that case)

@o-iegorov I’m nearly certain this portion of your recent statement is factually incorrect. The logic in this recurring data upgrade (the “fixed” version in 2.4) doesn’t account for schedule vs realtime separately, it simply calls reindexAll when it believes the indexer should be run: https://github.com/magento/magento2/commit/0fd8a5146cdf4e524150e68f89085d90f0d42be3#diff-0ac6816ed3ec11b7a9c59731fae99d4bR43-R44

Calling reindexAll in the above will simply result in the entire index being rebuilt regardless of the indexer mode. So the penalty still exists, and it would not (although I have not tested it) result in an asynchronous execution of the indexer. Unless I’m missing some other fundamental change in 2.4 codebase regarding indexers.

Hi, we’ve recently run into this issue within a Magento installation with 1100K customers. I’ve been investigating, and this is what I found, just in case it is useful for someone.

I know this issue is related with setup:upgrade performance related with customer_grid indexer, and this comment is about customer_grid indexer inner performance, but since it affects also setup:upgrade when reindexing all, I thought it would make sense to post it here.

About this comment:

It’s works just because it’s invalidated and reindexed in background by cron job. There are no mview processor for this indexer just dummy.

Although it’s true it has only a dummy mview, indexer does not get invalidated and reindexed by cron job, but synchronously upon Customer and Customer Address save, at \Magento\Customer\Model\Customer::reindex and \Magento\Customer\Model\Address::reindex, respectively. Index only gets invalidated when customer attribute is added and used in grid / modified and used in grid changed / deleted and used in grid, so a full reindex is needed to rebuilt the grid table properly.

At https://support.magento.com/hc/en-us/articles/360025481892-New-customer-records-are-not-displayed-in-the-Customers-grid-after-importing-them-from-CSV it says customer_grid index is not supported by “Update by schedule” due to performance reasons, but it does not specify any detail.

Digging a little deeper, we arrive soon at https://github.com/magento/magento2/blob/2.4-develop/app/code/Magento/Customer/Model/Indexer/Source.php, the data source provider for customer grid data. It provides an iterator to supply data to be indexed:

    /**
     * Retrieve an iterator
     *
     * @return Traversable
     */
    public function getIterator()
    {
        $this->customerCollection->setPageSize($this->batchSize);
        $lastPage = $this->customerCollection->getLastPageNumber();
        $pageNumber = 1;
        do {
            $this->customerCollection->clear();
            $this->customerCollection->setCurPage($pageNumber);
            foreach ($this->customerCollection->getItems() as $key => $value) {
                yield $key => $value;
            }
            $pageNumber++;
        } while ($pageNumber <= $lastPage);
    }

Benchmarking this method, we found that at each step, execution time increases a bit. After many steps, time elapsed at each step can be increased even by 10x. Taking a quick look at the code shows the issue here.

At each step, the same query is performed to retrieve data, with different sql LIMIT offset values. Having LIMIT [offset,] row_count, assuming a batch size of 10000, consecutive queries would look something like (very simplified):

  • SELECT * FROM huge_table LIMIT 0, 10000
  • SELECT * FROM huge_table LIMIT 10000, 10000
  • SELECT * FROM huge_table LIMIT 20000, 10000
  • (…)
  • SELECT * FROM huge_table LIMIT 1090000, 10000
  • SELECT * FROM huge_table LIMIT 1100000, 10000

Mysql starts building query results, and returns them as soon as it has the needed number of them. It is easy for the first query, but for the last one, it has to generate internally (due to joins, ordering, etc) the offset + 10000 results, to return only the last 10000, discarding the offset results. In short:

  • Step 1: Mysql generate 10000 results, returns 10000 results.
  • Step 2: Mysql generate 20000 results, returns 10000 results.
  • Step 3: Mysql generate 30000 results, returns 10000 results. (…)
  • Step 109: Mysql generate 1090000 results, returns 10000 results.
  • Step 110: Mysql generate 1100000 results, returns 10000 results.

A real example, using a query generated by the indexer, note the offset and the elapsed time: Captura de pantalla 2020-08-25 a las 5 21 01 Captura de pantalla 2020-08-25 a las 5 22 16

1.7 ms vs 20.4 s is a huge difference. Our solution looks like this:

    /**
     * Retrieve an iterator
     *
     * @return Traversable
     */
    public function getIterator()
    {
        $customerIdLastPage = ceil($this->count() / $this->customerIdsBatchSize);

        if (0 < $customerIdLastPage) {
            $customerCollection = clone $this->customerCollection;
            $customerIdPageNumber = 0;

            do {
                $customerIds = $this->customerCollection->getAllIds($this->customerIdsBatchSize, $customerIdPageNumber * $this->customerIdsBatchSize);

                foreach (array_chunk($customerIds, $this->batchSize) as $customerIdsChunk) {
                    $customerCollection->clear();
                    $customerCollection->resetData();
                    $customerCollection->getSelect()->reset(\Magento\Framework\DB\Select::WHERE);
                    $customerCollection->addFieldToFilter($this->getIdFieldName(), ['in' => array_map('intval', $customerIdsChunk)]);

                    foreach ($customerCollection->getItems() as $key => $value) {
                        yield $key => $value;
                    }
                }

                $customerIdPageNumber++;

            } while ($customerIdPageNumber <= $customerIdLastPage);
        }
    }

Explained:

  • We take advantage of $this->customerCollection having filters already applied, to retrieve which customer ids should be affected by reindex.
  • We split customer ids retrieval in chunks. We have set customerIdsBatchSize to 100000, to avoid to retrieve 1100K customer ids at once.
  • For each chunk of customer ids, we split that again in chunks of batch size, to return data in batches of that size.
  • Once we know the customer ids we have to get data for, we can remove the collection WHERE part, and replace it by the customer ids to be processed. This is possible due to filters had been already applied when retrieving customer ids.
  • Using customer id in WHERE clause allows to use column index to perform search faster, and avoids Mysql having to generate unneeded results due to offset.
  • Out of this code, we have reduce batch size from 10000 to 100, to avoid locking tables for a long time (in this case we prefer to query more often, once solved the query issue), and to generate less customer items at the same time (10000 => 100) to try to reduce memory usage (it also seems to help with execution speed, but I’m not 100% sure of this). This is optional.

Result for us is as steps get executed, execution time for each of them remains almost the same. This may not make a difference for small reindexing, but it really does for databases with large customer tables.

Also, we’ve implemented the mview system and “Update by schedule” for this indexer separately; we’ll check how that works, and find what performance issues are those which weren’t explained at Magento page. I’ll let you know if I find something new about that.

@davidalger Please read carefully provided link: image

It’s works just because it’s invalidated and reindexed in background by cron job. There are no mview processor for this indexer just dummy.

Upgrade call reindexAll in case when indexer is invalid, that a rare case for production. In case when invalidation was performed by some setup script (during adding some attribute for example) reindex is should be performed. But not every setup upgrade. Reindex itself in this indexer is very tricky - for example it’s creates related database tables and cannot be replaced with just invalidation. But if you know cases when current solution may be improved - please create related PR.

This fix also delivered to 2.3-develop and will be part of the 2.3.6 release

Hi @dambrogia Hi @allamsettiramesh i’m reopen this as this was not fixed. selection_287

Reopening, to verify with vanilla CE instance and 500k customer accounts.

Hi @vbuck, @Nazar65, @o-iegorov.

Thank you for your report and collaboration!

The issue was fixed by Magento team. The fix was delivered into magento/magento2:2.4-develop branch(es). Related commit(s):

The fix will be available with the upcoming 2.4.0 release.

@magento-engcom-team so you are asserting that https://github.com/magento/magento2/blob/2.3-develop/app/code/Magento/Customer/Setup/RecurringData.php is not run on every setup:upgrade? This is a big problem for a real store, probably not noticeable in a dev box with sample data.

The issue was about a reindex in a recurring script and the solution has a reindex in a recurring script.

@sdzhepa Is there any status update to what happened this ticket? You state you “suppose the issue has been resolved” but I don’t see anything related to that within this thread of comments/replies. Also the tags of Reproduced on 2.3.x and Fixed in 2.2.x are quite conflicting (added on Dec 5). Tagging the issue as Ready for Work on Dec 5 and then removing the assignment on Dec 5 without any reference as to what changed is also quite confusing if this “has been resolved”.

I’m using 2.3 EE and I’m seeing >1hr update times because Magento is reindexing the customer_grid index on every bin/magento setup:upgrade statement (occurs in the Magento_Customer module).

Is there a reason this needs to happen? It seems like this should not happen during the update.

The purpose of the update script is to install/update/modify schema between version. The purpose of the indexers is to enhance lookups.

The actions seem exclusive and separate from each other. Can anyone elaborate to why this reindex is needed during the update? And if it is in fact needed, Can anyone elaborate on what we can do to enhance the performance of it?

From what I read this is solved in versions 2.3.6 and 2.4.0. If you’re running versions below these, I’d recommend upgrading.

Too bad that it takes 6 months for a confirmed and fixed issue to be released. Not to speak of the 17 months it took to actually fix it. In case you use composer this is a quick way to patch your install.

cd vendor/magento/module-customer
curl -S https://github.com/magento/magento2/commit/0fd8a5146cdf4e524150e68f89085d90f0d42be3.diff | patch -p5 
curl -S https://github.com/magento/magento2/commit/436d0ae410101e526ac9326483788153de507f26.diff | patch -p5 

@o-iegorov Very interesting. I missed that note on the page. Thanks for the followup. Also g2k regarding 2.3.6 release. 👍

@davidalger @hostep I do agree with you that it’s not perfect solution, it could be improved, but it significantly improves situation when no customer attributes were affected. Feel free to create Pull Request with improvement based on your suggestions.

For those waiting for a fix i suggest to use a composer patch that removes the vendor/magento/module-customer/Setup/RecurringData.php, here’s mine:

From 7e3448e72335f31cb8fd52a1fedee23b265075bb Mon Sep 17 00:00:00 2001
From: Lorenzo Stramaccia <lorenzo.stramaccia@magespecialist.it>
Date: Thu, 14 Feb 2019 17:56:34 +0100
Subject: [PATCH] Remove Magento/Customer/Setup/RecurringData.php

---
 .../Magento/Customer/Setup/RecurringData.php  | 43 -------------------
 1 file changed, 43 deletions(-)
 delete mode 100644 vendor/magento/module-customer/Setup/RecurringData.php

diff --git a/vendor/magento/module-customer/Setup/RecurringData.php b/vendor/magento/module-customer/Setup/RecurringData.php
deleted file mode 100644
index fbef4c05d126..000000000000
--- a/vendor/magento/module-customer/Setup/RecurringData.php
+++ /dev/null
@@ -1,43 +0,0 @@
-<?php
-/**
- * Copyright © Magento, Inc. All rights reserved.
- * See COPYING.txt for license details.
- */
-
-namespace Magento\Customer\Setup;
-
-use Magento\Framework\Indexer\IndexerRegistry;
-use Magento\Framework\Setup\InstallDataInterface;
-use Magento\Framework\Setup\ModuleContextInterface;
-use Magento\Framework\Setup\ModuleDataSetupInterface;
-use Magento\Customer\Model\Customer;
-
-/**
- * Upgrade registered themes.
- */
-class RecurringData implements InstallDataInterface
-{
-    /**
-     * @var IndexerRegistry
-     */
-    private $indexerRegistry;
-
-    /**
-     * Init
-     *
-     * @param IndexerRegistry $indexerRegistry
-     */
-    public function __construct(IndexerRegistry $indexerRegistry)
-    {
-        $this->indexerRegistry = $indexerRegistry;
-    }
-
-    /**
-     * {@inheritdoc}
-     */
-    public function install(ModuleDataSetupInterface $setup, ModuleContextInterface $context)
-    {
-        $indexer = $this->indexerRegistry->get(Customer::CUSTOMER_GRID_INDEXER_ID);
-        $indexer->reindexAll();
-    }
-}

Regarding the pull request i don’t know what to say, we have to wait. I dont know if @orlangur is away, on holiday or something else, however i opened it the 14th of February.

So when the customer grid in admin needs a reindex, the whole world should wait? Some data in an admin grid might look outdated so let’s make the downtime of the whole shop longer?

I’ve read the thread but still, why not to schedule the indexer so that it can later be run asynchronously instead of executing it directly in the upgrade script? The upgrade itself takes seconds to get executed but all those Recurring.php take dozens of seconds or sometimes minutes. I agree that some stuff must be checked to keep the DB consistent but does it make sense, that the shop is in maintenance mode only because something in a grid or in a sales report has changed?

Whatever.

Any news for this ? It’s a huge thing, I’ve got performance issue because of this on many large projects

@LiamTrioTech: have you read this comment and this comment? It says it is fixed in 2.4.0 and will be fixed in 2.3.6 as well.

@o-iegorov Where do you see customer grid index only supports index on save? I have this index running in production today in schedule mode. This is an index using the materialized view patterns in M2, and it would make little sense for it to only support On Save.

When indexer is in invalid sate magento cron job will perform full reindex.

This may be correct, but what I’m saying is that what the upgrade does is call reindexAll which does not mark the index as invalid, it runs the index synchronously. So IF the upgrade runs while the indexer is in an invalid state, or should the indexer be marked invalid by say an upgrade routine that’s adding an attribute to the customer grid, the index will still run synchronously during the upgrade rather than simply letting the cron run the reindex to cleanup the invalid state.

@ihor-sviziev I don’t have time at the moment to create a PR to further enhance this on 2.4 (unfortunately). On the 2.3 project that highlighted this for me with a 40 minute grid reindex, I’m simply deleting the Recurring Data script from the customer module as a workaround. I’m mainly posting for the sake of others as I read the original comment to infer something regarding the asynchronicity of the indexer as it relates to the setup upgrade routine.

If the customer grid index is invalid it will perform a reindex during setup:upgrade and the store will be in maintenance for minutes during deploy. If this is the Magento solution i will update my patch to remove the new RecurringData script.

@slackerzz we have a progress! Internal team started working on this issue! https://github.com/magento/magento2/pull/26163#issuecomment-583422293

Please leave it open until #21235 is merged 😃

(this bot removing all these labels is kind of annoying to be honest)