mysql: packets.go: read tcp [ip-address]: connection timed out

UPDATE: My resolution was to remove all “Idle” connections from the pool. See this comment:

https://github.com/go-sql-driver/mysql/issues/257#issuecomment-53886663

I am currently experiencing a stalling or broken web app after a period of idle between 15 to 48 minutes. The most critical issue is described below:

Visit a URL, any url on the site, and load the page completely (as in, the page actually loads and the logs show a complete page has loaded).
Close browser, and wait.

A typical request is logged like this:

2014/07/13 15:29:54 INFO template rendering: index
2014/07/13 15:29:54 METRIC, URL: /, HANDLER TIME: 7.2339ms, CTX TIME: 5.0894ms, TOTAL TIME: 12.3258ms

After a long period of time (ranging from 15m to 48m), the system all of a sudden logs these lines below with no interaction - the web app has been idle this entire time:

[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 16:00:09 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 16:00:10 INFO template rendering: index
2014/07/13 16:00:10 METRIC, URL: /, HANDLER TIME: 8.8574ms, CTX TIME: 31m19.2606723s, TOTAL TIME: 31m19.2695329s

Notice the “TOTAL TIME” is 31 minutes and 19 seconds? Also, notice the MySql driver error that is logged at the same time?

There was no activity / no web request made. The web app was simply idle.

The most critical issue is what comes next after these log messages: _the very next web request is stalls completely, never returning a response_:

user@govm1:~$ wget http://localhost
--2014-07-13 17:11:18--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2014-07-13 17:26:19--  (try: 2)  http://localhost/
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html.4'

    [ <=>                                                                                  ] 6,310       --.-K/s   in 0.001s

2014-07-13 17:26:20 (9.61 MB/s) - `index.html.4' saved [6310]

And it sits idle, no response, for 15 minutes until wget times out.

Now, if I make a 2nd or 3rd request immediately after this one is stalled and anytime while it is stalled, the go web app responds and returns a full page for other requests. No issues. And then, the cycle starts over from the last request I make and let it site idle.

After this 15m, you can guess exactly what is logged next:

[MySQL] 2014/07/13 17:26:57 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 17:26:57 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 17:26:57 INFO template rendering: index
2014/07/13 17:26:57 METRIC, URL: /, HANDLER TIME: 6.8938ms, CTX TIME: 15m39.1718434s, TOTAL TIME: 15m39.1787398s

Another 15m wait time.

I eliminated Windows Azure, the Cluster VIP and Firewall/Linux VM running the go web app as an issue because I ran wget http://localhost locally on the same box, and I get this “stalled” request that never completes and never sends back anything.

There are a number of factors in my web app so I will try to outline them accordingly.

Using:

Go 1.3
go-sql-driver/mysql ## Version 1.2 (2014-06-03)
Ubuntu 12.04 LTS, ~June 2014 Updates
Windows Azure

Do note that the Linux box running MySql is a different Linux box running the cluster of GoLang apps - and they are in separate dedicated Cloud Services. The MySql vm is a single VM, no cluserting.

Here is some related code:

// global handler for our DB
var db *sql.DB

// CLI parameter
var dbdsn string

func init() {

    flag.StringVar(&dbdsn, "dbdsn", "root:root@tcp(localhost:3306)/prod?timeout=5s&tls=false&autocommit=true", "Specifies the MySql DSN connection.")
    flag.Parse()

    var err error
    db, err = sql.Open("mysql", dbdsn)
    if err != nil {
        log.Printf("ERROR in sql.Open(): %v", err)
    }

    //db.SetMaxIdleConns(5)

    // verify the DSN is setup properly1
    err = db.Ping()
    if err != nil {
        panic("PANIC when pinging db: " + err.Error()) // proper error handling instead of panic in your app
    }
}

// **********
// * omitted is the Gorilla MUX router and http handler registrations
// **********

func ArticleHandler(w http.ResponseWriter, r *http.Request, c *Context) (err error) {

    m := NewArticle(c)
    id := c.Vars["id"]

    var pid int
    var title, body, excerpt, date, slug, fi, fv, region, region_slug string
    err = db.QueryRow(
        "SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, p.post_date, p.post_name, "+
            "(SELECT fpim.meta_value FROM wp_postmeta fpim WHERE fpim.meta_key = '_wp_attached_file' AND fpim.post_id = (SELECT fpim2.meta_value FROM wp_postmeta fpim2 WHERE fpim2.post_id = p.ID AND fpim2.meta_key = '_thumbnail_id' LIMIT 1) LIMIT 1) AS featured_image, "+
            "(SELECT fpim3.meta_value FROM wp_postmeta fpim3 WHERE fpim3.meta_key = 'fv_video' AND fpim3.post_id = p.ID LIMIT 1) AS featured_video, "+
            "t.name as region, t.slug as region_slug "+
            "FROM wp_posts p "+
            "JOIN wp_term_relationships tr ON tr.object_id=p.ID "+
            "JOIN wp_term_taxonomy tt ON tt.term_taxonomy_id=tr.term_taxonomy_id "+
            "JOIN wp_terms t ON t.term_id=tt.term_id "+
            "WHERE p.post_name=? AND p.post_type='post' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP()"+
            "AND tr.object_id=p.ID AND tt.parent = (SELECT t3.term_id FROM wp_terms t3 WHERE t3.name=? LIMIT 1) LIMIT 1",
        id, RegionsParentCategory).
        Scan(&pid, &title, &body, &excerpt, &date, &slug, &fi, &fv, &region, &region_slug)

    if err != nil {
        if err == sql.ErrNoRows {

            // snipped code for redirects

            // article was not found
            return handleNotFound(w, r, c)

        } else {
            log.Printf("ERROR in .Scan(): %v", err)
        }
    } else {
        m.Region = Region{
            Name: region,
            Slug: region_slug,
        }
        m.Id = pid
        m.Title = title
        m.Body = template.HTML(body) // render the raw html
        m.Excerpt = excerpt
        m.Datetime = date
        m.Slug = slug
        m.FeaturedImageUrl = fi
        m.FeaturedVideoUrl = fv
    }

    web.RenderTemplate(w, "article", m)
    return
}

5 more DB queries, per request

In addition to this query, my “Context” you see being passed into the handler runs 4 to 6 additional SQL queries. Therefore, each “article” handler that loads runs about 5 to 7 SQL queries, minimal, using the exact same pattern and *db global variable you see above.

Timeouts / errors are always on the same DB query

Here’s one of the “context” queries as a comparison:

rows2, err := db.Query(
    "SELECT p.post_title, p.post_name "+
        "FROM wp_posts p "+
        "WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
        "AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
        "ORDER BY p.menu_order",
    FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
    defer rows2.Close()
    c.AllFooterPages = make([]FooterPage, 0)
    for rows2.Next() {
        var name, slug string
        err := rows2.Scan(&name, &slug)
        if err != nil {
            log.Printf("ERROR in AllPages row.Scan() : %v", err)
        } else {
            p := FooterPage{
                Page: Page{
                    Title: name,
                    Slug:  slug,
                },
            }
            c.AllFooterPages = append(c.AllFooterPages, p)
        }
    }
}

Nothing special there.

I do call defer rows2.Close() only if there was no error. Perhaps that is part of the issue? This particular SQL query seems to log errors under load tests as no response or mysql driver timing out.

Questions

Why am I getting request timeouts logged in excess of 15 to 30 minutes, from an idle site? That seems like a bug with the mysql driver I am using, possibly holding a connection open. But, the last http request was successful and returned a complete page + template.

I even have the Timeout set in the connection string, which is 5 seconds. Even if it is a problem with the mysql server, why the 15 minute timeout/request logged? Where did that request come from?

It still could be a MySql driver issue, blocking the request from completing - maybe being blocked by the MySql dedicated VM and an issue there. If that is the case, then how come nothing is logged? What is this random timeout of 15m to 49m minutes? It is usually only 15m or 31m, but sometimes 48m is logged.

It is very interesting on the “15m” multiples there in the timeouts (@15m, 31m and 48m), allotting for some padding there in seconds.

Thanks in advance.

About this issue

Original URL
State: closed
Created 10 years ago
Reactions: 4
Comments: 58 (25 by maintainers)

Commits related to this issue

add max idle timeout to fix https://github.com/go-sql-driver/mysql/issues/257 — committed to shusson/variantstore by shusson 7 years ago

Most upvoted comments

I found it to be the state of the connection pool, where the pool thinks there still is a connection open. But in fact, it has been closed remotely by the mysqld.

Scroll up for my entire story and debugging: I verified this with netstat where I saw the Go application having a connection to port 3306 on the remote server. But yet, the remote server running MySql no longer had any open pipes (after the timeout I noted above). I did verify that upon making the application active again, opening many mysql connections, that I saw both open pipes on both servers. But again, after some time, the remote mysqld server would drop the open “idle” connection (still not sure if it is mysqld, or Windows Azure doing it).

The root problem is that the net package, and/or this mysql driver, does not detect the dropped connection and thinks that there is still an open idle connection - there isn’t. So upon the next attempt by the mysql driver to make a query, with this stale idle connection that doesn’t exist, it creates a block and eventually times out after a long period of time. Any additional hits to the mysql driver for queries works fine, as the connection pool simply creates new connections going forward - after that 1 idle connection (e.g. pooling the connections, use the idle open one for the first 1, then create new connections). New connections are fine; it is that idle one that is the problem because it simply isn’t connected.

I resolved my issues by getting rid of all idle connections. See my posts above for the history and resolutions. But in short:

db.SetMaxIdleConns(0) // this is the root problem! set it to 0 to remove all idle connections
db.SetMaxOpenConns(500) // or whatever is appropriate for your setup.

+14

eduncan911 on Feb 17, 2015

Had the same problem: Didn’t want to set SetMaxIdleConns to 0 since that would effectively disable connection pooling. What I did was changed SetConnMaxLifetime to a value that was less than the setting on MYSQL server that closed connections that were idle. MySQL was set to 10seconds before it closed an idle connection so I changed it to 9 seconds. Bug was fixed. The default value keeps idle connections as long as possible, so the connection pool thinks a particular connection is alive when in fact MySQL had already closed it. The connection pool then attempts to use the connection and you get the error.

pjebs on Oct 19, 2017

So I had a theory… Since the MySql query is stalling on execution of an “idle” application for a long period of time, perhaps Windows Azure does something to that TCP connection. Was there a persistent connection in the goland mysql driver that maybe Azure doesn’t like across cloud services? (a cloud service is a collection of resources for scaling. e.g. all web servers in one cloud service to auto-scale up and down, all database servers in another cloud service, all backend jobs processing in another cloud service, etc). Perhaps Azure didn’t like a new tcp connection after being idle for so long?

So i set out to investigate these… Also, the SetMaxIdleConns() had me courious…

sudo netstat -tnp | grep wwwgo

Sure enough, this showed 1 single persistent connection from my wwwgo app to my mysql IP address and 3306 port - even after letting it sit for several hours without any interation, it was still open and persistent.

This got me thinking… Maybe Azure didn’t like long running tcp connections with no activity. So i set these:

db.SetMaxIdleConns(0)
db.SetMaxOpenConns(500) // matches my MySql my.cnf file settings

Running netstat shows no persistent connection now. And I have not seen the original problem since! I cannot reproduce the original problem any longer! Not only that, but now load-testing the site no longer gives too many files open error on a single CPu instance, nor do I have the i/o socket error error i got on multi-core MAXCPUS setting. Cool.

So the issue seems to be with a long running persistent connection that never closes.

I know I know… The next words out of anyone’s mouth are, “Are you sure you are calling rows.Close(), and that you are iterating over the entire collection? That will hold the connection open if you don’t!”

The answer is that I either do Scan() on single rows, or I follow this pattern for rows.Close() and iterating:

rows2, err := db.Query(
    "SELECT p.post_title, p.post_name "+
        "FROM wp_posts p "+
        "WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
        "AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
        "ORDER BY p.menu_order",
    FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
    defer rows2.Close()
    c.AllFooterPages = make([]FooterPage, 0)
    for rows2.Next() {
        var name, slug string
        err := rows2.Scan(&name, &slug)
        if err != nil {
            log.Printf("ERROR in AllPages row.Scan() : %v", err)
        } else {
            p := FooterPage{
                Page: Page{
                    Title: name,
                    Slug:  slug,
                },
            }
            c.AllFooterPages = append(c.AllFooterPages, p)
        }
    }
}

Note that I call defer rows2.Close() only if there was no error. I have also guaranteed I am always iterating and always rows.Close() in this pattern. Some may say, “well, there still may be data in the rows and you should iterate over them.” well, during my load tests today of 1000 concurrent users off and on for 15m intervals several times today, there was only the 1 persistent connection open to mysql at all times. This tells me that rows.Close() is being called ,and I am iterating over all rows - else, I would have more persistent connections open, right?

Shouldn’t a long-running idle connection be dropped?

Entity Framework, for Microsoft SQL Server, doesn’t do this “long running” idle connection pooling. It closes all queries.

If it is to remain open, shouldn’t there be a timed “ping” across the wire to keep it active?

Similar to how Android tcp connections are handled in where you constantly send a ping to keep the connection alive.

eduncan911 on Jul 15, 2014

Regarding your biggest question: read this part of our README and pay special attention to the link at the end. If I guess right, setting that parameter may help you and do what you expect from timeout. What you want is probably one of the server side timeouts, just read through the linked MySQL documentation for the matching one.

Answer 1: No, you should only Open() and Close() the DB once. It’s a connection pool, not a single connection. You only need one per user and server. Your main problem with the code above is to defer Close() in init - do not do that. Defer it in main. The deferred function is called when the scope is left - putting it in init will not help you there as init is done and Close is called before main is executed.

Answer 2: I don’t know what problems you have with Prepare, but I guess it’s that it is connected with the opening and closing of the connection pool. The prepared statements are invalid when the connection pool they are prepared in is closed.

Please get back to us here if that didn’t help you.

arnehormann on Jul 13, 2014

@eexit Did you read this comment?

methane on Mar 2, 2018

14400 sec is too long. One minutes is enough for most use cases. Even if you configure entire your DC (OS, switch, router, etc…), TCP connection may be lost from various reasons. (bug in router firmware, unstable power voltage, electric nose, etc…)

methane on Feb 20, 2017

(Happened to see this thread due to the Go ticket…)

This sounds like a plain, honest TCP timeout. I don’t know much about MySQL or the driver, but what usually happens is this:

The TCP connection is killed by some network device / server but it doesn’t tell the other end about it. This leaves the TCP connection in ESTABLISHED on the client’s end while its state is actually gone from the server.
Driver calls net.Conn.Write to send a query over, which ultimately effects a write() system call. The data fits into the OS’ buffers, and thus the call doesn’t block.
Now, with the data in its buffers, the operating system’s TCP stack starts sending the data to the remote end.
The driver calls net.Conn.Read. This will block since there’s nothing to read yet since the server hasn’t replied (it hasn’t even received the query, since there’s no healthy connection).
The TCP stack tries to send, re-send, re-send, etc. the data to the server. It will keep doing this for some specific amount of time (it was 15 minutes on our out-of-the-box Linux installation) before giving up, and then tells the client that the connection timed out.

The problem with setting a deadline is that for a lot of of apps it is either going to be annoyingly low and you’re risking failures on just genuinely slow queries (perhaps if the database is under heavier load), or it’ll still take annoyingly long before the deadline is reached, thus not really solving the problem.

What’s really worth looking into is why step number 1 happens in the first place, and fix that. TCP timeouts and/or reconfiguring the network settings/gear can usually get you quite far. For some apps, deadlines can be nice in addition, but I don’t think they’re usable as a general workaround.

johto on Feb 13, 2015