magento2: Layout corruption with php-fpm (a Magento 1 problem returns)

Preconditions

  1. Run Magento 2 (any version) under php-fpm. Any valid php version - tested with php 5.6 to 7.0.11

Steps to reproduce

  1. Hit it with heavy traffic including xmlrpc calls
  2. Visit a page on frontend or backend

Expected result

Actual result

  1. Page displays without head tag (sometimes - intermittent)

This is fixed by clearing the layout cache. This displays in the same way as an issue in Magento 1 - see https://www.c3media.co.uk/blog/c3-news/security-fix-might-take-site

The fix in Magento 1 was to add the following to bootstrap.php:

if (function_exists('libxml_disable_entity_loader')) {
    libxml_disable_entity_loader(false);
}

Why is this needed? Because there are a number of places in Magento 2 where $oldval = libxml_disable_entity_loader(true) is called and then later libxml_disable_entity_loader($oldval). In theory this should be fine, but the value is not thread-safe, so when running under php-fpm it only requires a call to occur in between these statements for the entire thread to end up set to TRUE. Why this then corrupts to cache it was never clear in Magento 1, but that was the upshot.

The above fix was added in Magento 1.9.2.0, but was not included in Magento 2. In one file (vendor/magento/framework/Xml/Security.php) there is an acknowledgment of the issue:

        /**
         * If running with PHP-FPM we perform an heuristic scan
         * We cannot use libxml_disable_entity_loader because of this bug
         * @see https://bugs.php.net/bug.php?id=64938
         */

…but sadly it still calls libxml_disable_entity_loader despite the warning. This is also called in some of the Zend and Symfony components, so it’s not really reasonable to handle all these situations. Thus the need for the suggested fix.

Not entirely sure how to fix this in the short-term, as unlike Magento 2 we cannot just edit the bootstrap.php file in Magento 2.1.0. Any suggestions?

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 32 (12 by maintainers)

Commits related to this issue

Most upvoted comments

The real production case for which I’ve spent 40 hours debugging:

Symptoms Page with empty HTML. Literally any page can be empty. (we catched it on production on checkout/success page)

Full cache flush helps temporary.

Pre-requisites

  • AWS Infra:
    • front load balancer with two Varnish+nginx servers
    • internal load balancer with several Apache+mod_php servers
    • redis master + redis replica
  • Magento cache configuration:
    • master node + replica for read requests
    • write_control option is by default (Zend_Cache_Core feature)
  • High load during promo campaign

How It Happens

  • customer sends request when Varnish and Redis cache are empty
  • Magento builds layout and saves it in 3 cache keys: layout XML, page layout (for example: “1column”), JSON with layout structure
  • after every cache save Zend_Cache_Core checks written value by reading it again and comparing to original value
    • if values do not match it deletes created cache key
  • under high load and in case of Redis master+slave configuration Redis doesn’t have enough time to replicate changes
  • Zend_Cache_Core decides that value was not written correctly and tries to delete it (deletion happens on Master node of redis)
  • sometimes it happens that only 1st item from layout cache is saved (XML), but 2nd (1column) and 3rd (JSON structure) are not saved to cache
  • on next page load of the same URL customer gets to 2nd Varnish node which doesn’t have cache
  • Magento starts to build layout from partially cached data, but fails to do it correctly
    • Magento reads 1st item (XML) from cache
    • then it tries to read 2nd item (“1column”) but do not check returned value (!)
    • then it builds layout structure JSON and saves broken value to cache (“root”, “head” and many other blocks/containers are absent, because of pageLayout==false)
    • and finally caches empty HTML in Varnish

How to easily reproduce

  1. Open redis-cli and find cache keys of your page: KEYS *LAYOUT* You may need to filter by layouts hash.
  2. Delete *_PAGE_LAYOUT cache keys (redis command “DEL”)
  3. Delete STRUCTURE_LAYOUT cache keys (redis command “DEL”)
  4. If you have Varnish - flush it. If you have full page cache - delete those items from redis manually.
  5. Refresh page

Actual result: Empty page, no HTML code at all. Error in logs “before.body.end” doesn’t exists.

Root cause Magento does not check return of cache item with page layout value (for example: “1column”, “3columns”)

Permanent fix Store Layout XML and Page Layout items in one cache item (*_PAGE_CACHE_MERGED) It will eliminate the case when one item is lost and also it should give very tiny performance improvement, because 1 item is saved/loaded instead of two.

Pull requests

Other action items

  • Check how magento behaves on Redis cache with write_control=false

Sorry, that it’s been a while. This bug came back to mind when I noticed one of sites still had layout cache disabled in production as a precaution. Lo and behold, the bug still exists in the latest version of Magento (currently 2.2.3). So here’s what I’ve discovered.

Steps to Reproduce:

  1. Install Magento with Redis cache (I used db 0 for sessions, 1 for cache, 2 for FPC).

  2. Navigate to a page not cached by FPC (I used customer/account/login).

  3. Delete the page layout cache key in Redis. (Found by doing a var_dump of $cacheIdPageLayout in Magento\Framework\View\Model\Layout\Merge::load).

  4. Refresh the page and behold the headlessness of your broken page.

Now this is just how to quickly reproduce what happens under heavy traffic as mentioned in the initial bug report. We have only experienced this happening under heavy traffic as well. The problem is in the logic of this function:

// Magento\Framework\View\Model\Layout\Merge
public function load($handles = [])
...
$cacheId = $this->getCacheId();
$cacheIdPageLayout = $cacheId . '_' . self::PAGE_LAYOUT_CACHE_SUFFIX;
$result = $this->_loadCache($cacheId);
if ($result) {
    $this->addUpdate($result);
    $this->pageLayout = $this->_loadCache($cacheIdPageLayout);
    return $this;
}
...

We can see here that there is a check as to whether the layout updates were loaded from cache, but if they were, it’s just assumed that page layout will be loaded from cache as well. The problem is that when Redis reaches its maxmemory setting (as it might under heavy traffic), it will begin evicting keys (see https://redis.io/topics/lru-cache). So if the page layout key is evicted while the layout key is not, the bug occurs. To fix this bug, we have to check for a successful load of both layout and page layout from the cache.

I will create a PR with a fix for this. Hopefully Magento can get this fixed quickly and we’ll all be able to turn layout cache back on!

@TomashKhamlai why was this issue closed while PR https://github.com/magento/magento2/pull/16428 and https://github.com/magento/magento2/pull/14129 have never been merged? It seems to me that https://github.com/mrIntegrator/magento2/blob/2.3-develop/lib/internal/Magento/Framework/View/Model/Layout/Merge.php#L441 hasn’t changed either and I assume this issue still exists.

We’ve been having the same problem with a M2.3 store when redis is set so maxmemory-policy allkeys-lru. Currently testing with volatile-lru.