msphpsql: Network timeout leads to full driver crash when using connection pooling: SMux Provider Physical connection is not usable

Driver version

ODBC v17 (libmsodbcsql-17.1.so.0.1) php_sqlsrv 5.2.0 unixODBC 2.3.6

unixODBC 2.3.6

unixODBC is compiled with the following commands as recommended in bug #756

# clean existing
rm -f /usr/lib/x86_64-linux-gnu/libodbcinst.so*
rm -f /usr/lib/x86_64-linux-gnu/libodbc.so*

# try to upgrade unixodbc-dev
export CPPFLAGS="-DSIZEOF_LONG_INT=8"
apt-get -y install gcc-5 g++-5
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60 --slave /usr/bin/g++ g++ /usr/bin/g++-5
cd ~
echo "Configuring the unixODBC 2.3.6 Driver Manager"
wget ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.6.tar.gz
tar -xzf unixODBC-2.3.6.tar.gz
cd unixODBC-2.3.6/
./configure --prefix=/usr --libdir=/usr/local/lib --sysconfdir=/etc --disable-gui --disable-drivers --enable-iconv --with-iconv-char-enc=UTF8 --with-iconv-ucode-enc=UTF16LE 1> odbc_con.log 2> make_err.log

echo "Building and Installing the unixODBC 2.3.6 Driver Manager"
make
make install
echo "####### build ready ###### "

SQL Server version

Azure SQL Database

Client operating system

Ubuntu 16.04

PHP version

php 7.1.16 on Apache prefork with 10 processes

Table schema

see https://gist.github.com/jeff1985/ad9316fdb3313bf1a9e0b0926254df83#file-msodbcsqlbug-db-structure-sql

Problem description

We have identified a problem within php_sqlsrv, which causes downtime of our php application when using connection pooling with the latest stable drivers in case of network errors.

We have seen that problem multiple times occuring while using a php application hosted on Azure Linux Webapps with php running in a docker container.

After timeout/connection errors the database driver comes into a failure state where all following database requests will keep throwing errors as long the traffic keeps flowing. I suspect that the connection is not released from the pool despite it is in faulty state.

How to reproduce

I’m able to reproduce the problem in a development environment by using an unstable mobile internet connection.

In my repro script I’ve selected one row from the database and logged the result into a file. Additionaly I created a moderate server traffic by running ab -n 100000 -c 10 http://xxx.local/web-frontends/public/test_incomplete_data.php. My mobile internet connection has an expected error rate of about 5%.

I can see that the first requests are served correctly, but then after some initial connection errors occur, the whole server becomes unusable. All following requests are throwing an exception with the error message: SQLSTATE[08S02]: [Microsoft][ODBC Driver 17 for SQL Server]SMux Provider: Physical connection is not usable [xFFFFFFFF]

During this time I have no problems accessing websites or querying the same database using the same internet connection from another client system.

Even after I stop incoming requests the server remains in this faulty condition for approximately 2-5 minutes and does not respond correctly even to single requests.

Only after having a pause it then recovers and responds to requests again. In production environments you will never be having such scenario because users will be trying to reload the failed website. This means: the server will stay in this faulty condition for hours and the only way to recover will be a server reboot.

The resulting error rate in my measurements is 98%! Also 98% of the requests have a response time of less than 100ms, which clearly indicates that no new connection attempt has been done.

Expected behavior and actual behavior

The expected behaviour in case of a sporadic network error/timeout is:

server fails to connect to a database
a PDOException is raised on php side on each connection timeout
after each timeout, the connection is removed from the connection pool
on following requests, a new connection is opened, to allow an immediate reconnect to the database

The actual behaviour is:

server fails to connect to a database
a PDOException is raised on php side on each connection timeout
all following requests are failing even after the internet connectivity is fully recovered

Repro code

All relevant logs and a repro script are available here: https://gist.github.com/ad9316fdb3313bf1a9e0b0926254df83

Note: I removed db credentials from logs and code.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 26 (14 by maintainers)

Most upvoted comments

@jeff1985 may I suggest then on the after the step you posted in your first post here:

cd unixODBC-2.3.6/

You do the following to get the changes mentioned:

wget https://patch-diff.githubusercontent.com/raw/lurcher/unixODBC/pull/16.patch
patch < 16.patch

And then continue like you did before.

jubalh on Jul 18, 2018

@jeff1985 Great! Glad it solved the issue. I can’t take all the credit. @yitam helped with the repro and testing the fix and @v-chojas debugged and coded the actual fix.

As far as waiting for unixODBC 2.3.7, I don’t have any insight into that release schedule so I don’t know if you want to wait that long. We currently run 2.3.6 in our test suites for the ODBC driver. However the changes in our patch should be safe for production given that the previous code is basically broken for connection pooling with a Unicode driver and the scope of our changes is very narrow (you can review the changes yourself in the PR). We have no problem supporting your configuration with the patch we submitted.

Regards, David

David-Engel on Jul 19, 2018

You are my hero, @David-Engel !

I tested your patch. It solves the errors in my local repro, as well as the both reported scenarios in Azure infrastructure! Well done!

Is this patch safe to use in production environment or should I wait until official release of unixODBC 2.3.7?

jeff1985 on Jul 19, 2018

Hello @David-Engel

I would like to give you feedback after going into production with the fix.

We’ve been testing the fix under non-production for weeks and had no issues. Also we’ve been running it for 1 week in a production environment with no issues for about 7 days. However after executing a Azure SQL database failover procedure to a different region (by failover group) a similar connection issue occured with the fix in place. Again some of the requests were failing with the message “The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable.” This message seems to be connected to connection resiliency (based on my google search). I also had similar behavior in test environment with a load test executed with a low load profile and executing INSERT querys. I suggest that you add a master slave failover scenario to your test suite in one of the upcoming releases!

After more testing I see that this is however only happening during a database failover procedure. The connection issues seem to be successfuly mitigated by the patch under normal conditions. So I’m able to use it in production! My thanks to your team: @yitam and @v-chojas

jeff1985 on Aug 29, 2018

@jeff1985 It looks like we found the problem in the unixODBC driver manager. We were able to initially reproduce your scenario of constant errors using the “disable the internet” method and then again against an Azure SQL DB instance after changing the DTU like you mentioned. We compiled a private build of unixODBC with fixes and verified that it resolves those two scenarios. Connections recovered better and never got into a persistently errored state after the changes. We submitted a pull request for the changes to unixODBC here: https://github.com/lurcher/unixODBC/pull/16

You may not want to wait for the next unixODBC release for testing so there are a couple options:

You can get the source from the PR above and build it yourself. That’s what we ran and would be equivalent to 2.3.7 prerelease + our changes.
We can give you a tar.gz copy of the unixODBC-2.3.6 source that just has the changes we submitted in our PR.

Regards, David

David-Engel on Jul 18, 2018