unit: apcu_fetch(): Failed to acquire read lock on production testing
Hello,
We are trying to replace php-fpm with nginx unit, running inside docker, but after tuning the setups to the same resources and configuration, we get this error from apcu lib:
Noticed exception 'ErrorException' with message 'apcu_fetch(): Failed to acquire read lock' in /var/www/.../ApcuStore.php:23
The code there is very simple:
/**
* @param array $items
* @param int $ttl
*/
public function store(array $items, int $ttl): void
{
apcu_store($items, null, $ttl);
}
public function fetch($identifiers): array
{
return apcu_fetch($identifiers);
}
Any idea why this is happening? Do we need to change something in regards to concurrency locks now that php is running in embed mode?
Thank you in advance! R
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 18 (11 by maintainers)
Commits related to this issue
- Isolation: switch to fork(2) & unshare(2) on Linux. On GitHub, @razvanphp & @hbernaciak both reported issues running the APCu PHP module under Unit. When using this module they were seeing errors li... — committed to ac000/unit by ac000 2 years ago
- Isolation: switch to fork(2) & unshare(2) on Linux. On GitHub, @razvanphp & @hbernaciak both reported issues running the APCu PHP module under Unit. When using this module they were seeing errors li... — committed to ac000/unit by ac000 2 years ago
- Isolation: switch to fork(2) & unshare(2) on Linux. On GitHub, @razvanphp & @hbernaciak both reported issues running the APCu PHP module under Unit. When using this module they were seeing errors li... — committed to ac000/unit by ac000 2 years ago
- Isolation: Switch to fork(2) & unshare(2) on Linux. On GitHub, @razvanphp & @hbernaciak both reported issues running the APCu PHP module under Unit. When using this module they were seeing errors li... — committed to nginx/unit by ac000 2 years ago
- Isolation: Switch to fork(2) & unshare(2) on Linux. On GitHub, @razvanphp & @hbernaciak both reported issues running the APCu PHP module under Unit. When using this module they were seeing errors li... — committed to nginx/unit by ac000 2 years ago
Here’s what I’ve found so far.
tl;dr
The issue seems to be caused by our usage of clone(2) (well actually we call the kernel system call directly and not the glibc wrapper).
Gory details
This all applies to Linux (maybe others…)
Some background
The APCu module uses a mmap(2)'d file as a shared memory cache between processes. At the start of this memory is a header and in that header is a lock structure for whatever locking mechanism is in use, which on Linux will most likely be POSIX Read/Write locks, pthread_rwlock-*, which uses a
pthread_rwlock_tstructure to store state.One of the members of the
pthread_rwlock_tstructure is__cur_writer, that will be important.Test setup
Latest(ish) unit (1.28.0-30-ga0327445) compiled from source, latest APCu (5.1.22) compiled from source
Unit configuration
PHP test
Then hit it with ApacheBench
The problem
We get errors like
'apcu_fetch(): Failed to acquire read lock'due to glibc (well, only been debugging this on Fedora) detecting a potential deadlock situation and failing the lock acquisition, via checks like
Which is kind of weird. There is no threads involved and a process will acquire a read or write lock, do it’s thing, then release it.
Testing the same exact thing under php-fpm showed no issues.
… some time later …
Having a look again with gdb after we have acquired a lock (either one) showed the
pthread_rwlock_tstructure as the following (this is for unit)At first glance it all looks sane.
However, remember that
__cur_writermember?, that’s supposed to store the PID of the process which currently holds the write lock.Lets look at what processes we have
Here __cur_writer == pid of the main unit process. Hmm, that’s not right! (It should be one of the php application processes).
Just for a sanity check, I looked at the same thing with php-fpm
Here __cur_writer == pid of the php-fpm worker process. Good, as expected.
I started suspecting it could have something to do with how we create the processes, using clone rather than fork(2).
Indeed, forcing unit to use fork(2) (which it will do if there is no clone available) made the problem go away. I also replicated this issue with a simple test program to verify.
Solution
So to recap, the problem is that when glibc checks the current thread id against what’s stored in
__cur_writer, they both seem to be set to the PID of the main unit process and so glibc thinks the process is trying to acquire a lock (either one) when it already has the write lock, but it doesn’t really.At the moment I’m not entirely sure how to fix this.
Currently our processes are created something like
whereas if you use fork(2) which is implemented with clone, you get
I believe the key bits here are the
CLONE_CHILD_SETTIDflag and the setting ofchild_tidptrwhich in glibc’s case points into some core thread memory area which I assume updates THREAD_SELF->tid as used in the above glibc check.More investigation required.
Updated branch with improved version.
Great @ac000! if you still need anything, just let me know 😃