okhttp: java.net.SocketTimeoutException from HTTP/2 connection leaves dead okhttp clients in pool
Tried writing a unit test w/ TestButler on Android w/ no luck, so I’ll write up the steps to reproduce this and include some sample code. This happens if you connect to an HTTP/2 server and your network goes down while the okhttp client is connected to it:
- create an okhttp client
- tell it to read from the HTTP/2 server
- bring the network down
- tell it to read from the HTTP/2 server (it’ll get a SocketTimeoutException)
- bring the network back up
- tell it to read from the HTTP/2 server again (it’ll be stuck w/ SocketTimeoutExceptions)
- if you create new http clients at this point, it’ll work, but the dead http client will eventually come back in the pool and fail.
okhttp client should attempt to reopen the HTTP/2 connection instead of being stuck in this state
Code sample for Android (create a trivial view w/ a button and a textview):
public class MainActivity extends AppCompatActivity {
OkHttpClient okhttpClient = new OkHttpClient();
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
Button loadButton = (Button) findViewById(R.id.loadButton);
TextView outputView = (TextView) findViewById(R.id.outputView);
loadButton.setOnClickListener(view -> Observable.fromCallable(() -> {
Request request = new Request.Builder()
.url(<INSERT URL TO YOUR HTTP/2 SERVER HERE>)
.build();
Response response = okhttpClient.newCall(request).execute();
return response.body().string();
})
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(outputView::setText, t -> outputView.setText(t.toString()))
);
}
}
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 31
- Comments: 148 (21 by maintainers)
Links to this issue
Commits related to this issue
- Configure http/2 ping interval Help reap zombie connections. https://github.com/square/okhttp/issues/3146#issuecomment-418168869 — committed to spotify/styx by deleted user 6 years ago
- Added workaround for okhttp issue https://github.com/square/okhttp/issues/3146 — committed to snabble/Android-SDK by ajungg 5 years ago
- Force HTTP/1.1 in the player OkHttp has a bug leading to the player hanging when rapidly seeking through the video: https://github.com/square/okhttp/issues/3146. — committed to proxer/ProxerAndroid by rubengees 5 years ago
- TaskRunner, an abstraction over ExecutorService I want to tighten up our executors for a few reasons - Fix daemon vs. non-daemon problems - Fix code unloading problems - Be able to wait for async ... — committed to square/okhttp by swankjesse 5 years ago
- TaskRunner, an abstraction over ExecutorService I want to tighten up our executors for a few reasons - Fix daemon vs. non-daemon problems - Fix code unloading problems - Be able to wait for async ... — committed to square/okhttp by swankjesse 5 years ago
- TaskRunner, an abstraction over ExecutorService I want to tighten up our executors for a few reasons - Fix daemon vs. non-daemon problems - Fix code unloading problems - Be able to wait for async ... — committed to square/okhttp by swankjesse 5 years ago
- Degrade connections after a timeout This is based roughly on the 'Degraded Connections' proposal here https://github.com/square/okhttp/issues/3146#issuecomment-471196032 I'm using 1000 ms instead of... — committed to square/okhttp by swankjesse 5 years ago
- Degrade connections after a timeout This is based roughly on the 'Degraded Connections' proposal here https://github.com/square/okhttp/issues/3146#issuecomment-471196032 I'm using 1000 ms instead of... — committed to square/okhttp by swankjesse 5 years ago
- Degrade connections after a timeout (3.14.x branch) This is a manual cherry-pick of 09da07c2c8981f88346adb818ce42512d9f2f288 See also the degraded connections proposal. https://github.com/square/okh... — committed to square/okhttp by swankjesse 4 years ago
- Degrade connections after a timeout (3.14.x branch) This is a manual cherry-pick of 09da07c2c8981f88346adb818ce42512d9f2f288 See also the degraded connections proposal. https://github.com/square/okh... — committed to square/okhttp by swankjesse 4 years ago
- Degrade connections after a timeout (3.12.x branch) This is a cherry-pick of 6a9a64c8f131b33bdd9b7077ce4e2456db0dcd19 See also the degraded connections proposal. https://github.com/square/okhttp/iss... — committed to square/okhttp by swankjesse 4 years ago
- Degrade connections after a timeout (3.12.x branch) This is a cherry-pick of 6a9a64c8f131b33bdd9b7077ce4e2456db0dcd19 See also the degraded connections proposal. https://github.com/square/okhttp/iss... — committed to square/okhttp by swankjesse 4 years ago
- Workaround for https://github.com/square/okhttp/issues/3146 — committed to SonarSource/orchestrator by henryju 4 years ago
- Bump okhttp3 to 3.14.9. According to https://github.com/square/okhttp/issues/3146#issuecomment-569986444 an issue with stale connections that caused SocketTimeoutException errors was fixed in 3.14.5. — committed to atlassian-labs/atlassian-slack-integration-server by utluiz 3 years ago
- Bump okhttp3 to 3.14.9. According to https://github.com/square/okhttp/issues/3146#issuecomment-569986444 an issue with stale connections that caused SocketTimeoutException errors was fixed in 3.14.5. — committed to mgoyal2-atl/atlassian-slack-integration-server by utluiz 3 years ago
- Workaround for OkHttp Interrupt issues. Relates to https://github.com/square/okhttp/issues/3146. This was from https://github.com/androidx/media/pull/71. There is a draft PR https://github.com/squar... — committed to androidx/media by yschimke 2 years ago
- Workaround for OkHttp Interrupt issues. Relates to https://github.com/square/okhttp/issues/3146. This was from https://github.com/androidx/media/pull/71. There is a draft PR https://github.com/squar... — committed to google/ExoPlayer by yschimke 2 years ago
- [#185024471] Initial Subscription inaccuracies still exist - [x] close down all cached connections upon socket time out (see)[https://github.com/square/okhttp/issues/3146] - [x] units and impls — committed to xenonview-com/view-java-sdk by lwoydziak a year ago
I think i’m seeing another manifestation of this on 3.5.0, when the server forcibly closes the connection.
We try to establish both a h2 and http1.1 connection. The server responds with 200 to both:
Then at some point we try to read from the http2 connection, which fails in checkNotClosed and throws a StreamResetException
Then, since this is media, we do something that causes a seek to 0 in the media, which needs to reopen the request from the beginning. At this point, we see the same exception as is posted above:
this seems to be very similar to the other cases here, which seem to all be related to an ungraceful shutdown of the connection, and it remaining pooled.
I’ve also confirmed that disabling the ConnectionPool “works around” this issue:
Also still getting this problem on emulator with api 22, and 3.14.4. Also I get a SocketTimeoutException after 2 minutes (what my readTimeout is set to), instead of 10 seconds (what my connectTimeout is set to). The workaround using
.connectionPool(new ConnectionPool(0, 1, TimeUnit.NANOSECONDS))
still works. I’d say it’s time to re-open this 😦. Steps to reproduce are same as OP.I can confirm the issue doesn’t exist when using a real device Note 9, API 29.
Degraded Connections
Here’s a proposal for a fix.
When the HTTP/2 reader hasn’t received any frames for 500 ms and a stream times out on a read, we degrade the HTTP/2 connection by setting a new
degraded
field to true. The stream remains degraded until any data is received. The connection pool will not return degraded connections. Instead it will establish new connections.When a connection becomes degraded we also send a degraded ping and set a new
awaitingDegradedPong
field to true. We have at most one degraded ping in flight at a time. The motivation of this ping is to trigger a pong to be received.500 ms?
Thrashing in and out of the degraded state will be bad for performance if a busy connection has a few bad streams. If the connection has received something within 500 ms, it’s likely a bad stream and not a bad connection.
Interaction with Ping Interval?
The pings here are independent of the OkHttpClient’s
pingInterval
, if one is set.Drawbacks
The HTTP/2 code is pretty busy already, and this adds more. Keeping a timestamp of the most recent frame could be particularly annoying. We should use nanoTime(), not currentTimeMillis() for this.
This addresses read timeouts only. We can’t ping our way out of write timeouts; the pings will be queued up behind other outbound data! I need to study this further.
FYI, we found a workaround…set the connectionPool in the builder so it uses a new connection pool w/ a size of zero and also turn off HTTP/2 support by setting a new protocolList in the builder with only HTTP/1.1 support.
This is fixed 4.3. Keeping this open until I backport #5638 to 3.12.x and 3.14.x.
Guys. Be aware of the temporary bug fix of disabling the connection pool cache.
We began to receive a lot of complaints about our app hanging from our users and we started to explore and profile our app to check what might be the problem. After a lot of search we found out that our app was allocating very fast a lot of objects in a short amount of time. First we saw a lot of this logs relative to Garbage collector
Then we found out this when profiling the app
Normally you would find in first position of a dump, primitive objects like “int”, “char”, etc…
Everytime we make a request a new connection is put in ConnectionPool which triggers a cleanupRunnable which in turn in runs a while(true) loop. Insied this infinite loop a method cleanUp() is called that in turn loops the connections list using an iterator of an ArrayDeque that creates a new Deque Object every time it is called, thus allocating Deque objects without mercy. Because of the rate of object creation, the gc enter in action a lot of time to try to free up memory, and it had a side effect. It was blocking our app background threads, thus blocking the app flow.
The gc was in concurrent mode, and this mode does not blocks app threads, but the reality is that they were being blocked anyway.
This allocated dequeue objects eventually will be destroyed by the GC after some time, but the issue here is the rate of object creation that triggers the GC a lot of times when a http request is made.
Still having problems with 3.12.12 on Samsung Galaxy A7 (2018) SM-A750FN/DS, Android 10 (One UI 2.0).
Unless I set custom parameters as mentioned above:
okhttp: 3.11 SocketTimeoutException is not fixed still its appears
Thanks for the repro.
You dont need to disable connectionPool, just insert inside your BroadcastReceiver when the network changes the following code
As socket timeout exception is an instance of IO exception, I am not sure if the following approach will work. Can one of you pls get back to me?
I am calling evictAll() in the catch block of IOException.
Also how do we check if a connection is stale or not?
With Apache HttpClient, there is a way to do it to set a flag for checking stale connections. Wondering how OkHttp3 checks for it internally before it uses the connection.
In the last month, since we had this issue crop up, we had 14 occurrences, across 5 OS versions, 6 manufacturers and 12 models.
OS Versions: Android 12 - 5 instances Android 10 - 4 instances Android 11 - 2 instances Android 8.1.0 - 2 instances Android 9 - 1 instances.
Models:
Archos Alba - 2 instances Samsung Galaxy A52s 5G - 1 instances Xiaomi 11T Pro - 1 instances Xiaomi Poco X3 NFC - 1 instances Google Pixel 4A - 1 instances Samsung Galaxy A12 - 1 instances Samsung Galaxy S20 FE - 1 instances Samsung Galaxy S8 - 1 instances Samsung Galaxy S9 - 1 instances Samsung Galaxy S9+ - 1 instances Sony Xperia 10 III - 1 instances Motorola E7 Power - 1 instances
I’ve just pushed and update to our users, changing the connection pool and protocols, as per one of the first posts.
I’m unable to provide any more info for the time being, we’ve mostly run into this issue when using our
ForceUpdateInterceptor
to, well, force our users to update their application. Here is the code snippet:I’ll report back whether the aformentioned suggestion still produces the issue.
This was all with OkHttp version 5.0.0-alpha.7 and previous alphas.
I think the correct fix for now is in Media3/ExoPlayer, adding an explicit
response.close()
Hi @swankjesse ,
I am able to reproduce such issue on ExoPlayer v2.15.1 (OkHttp v4.9.1).
It is quite easy to reproduce event on ExoPlayer demo app. FYI @ojw28
I tried ExoPlayer’s demo content: https://storage.googleapis.com/wvmedia/clear/hevc/tears/tears.mpd And I force OkHttp client’s protocol,
I can see in charles that protocol is HTTP/2.0 and ALPN is h2
Steps to reproduce:
@vellrya if you can reproduce this, we can fix it. As is it’s unclear what the cause is, and even if it’s in OkHttp and not the OS itself.
Problem still exists in OkHttp 4.2.0, 3.14.3, 3.12.5 - checked on genymotion emulator (turn on and off airplane mode)
@swankjesse but i think this is a bad solution(Timed execute ping). I think that if an Exception occurs, this connection big probability is wrong. you say There are ways a stream will time out that don’t signal a connectivity problem. This situation may occur when the ‘read byteCount’ is set too large, This situation is very rare. code:
So I insist that the connection is released when it is TimeoutException. or There is another way to execute ping in TimeoutException.
I can confirm this is still an issue.
It would be great to get a fix for this. Any release date?
Pretty sure this issue is another manifestation of this one:
https://github.com/square/okhttp/issues/3118
@robertszuba I’ll take a further look, since I was able to repro with ExoPlayer with the same symptom, I was focusing on that.
It’s likely these are two separate bugs in that case.
Your repro seems quite simple, I’ll try to reproduce with it on the weekend and get back to you.
I can’t repro with pings on (still on an emulator), so if you are ok with the additional traffic and keeping radio awake etc. That is worth trying.
The more I look into doing smart things in Android, the more I suspect that the Android network engineers know what they are doing and the defaults are pretty good.
So far I just suspect a bug in the emulator network emulation.
We definitely get enough events from Android we can choose to listen to, and actively drop the connection/force close the socket. But it’s non-trivial code.
It might be best implemented as a custom Android SocketFactory, that listens for changes to the active network, and ties each socket to the network at creation time (through either default active network, or by looking at the local address).
I’ve got a good repro in a React Native test app that shows network state and the connection pool, so I’m going to explore the best options to resolve automatically within OkHttp.
After some debugging e found, (not sure if this helps) the following.
From what i understand this runnable, which is always running while the connection is healthy, reads from Http2Connection BufferSource that then calls a http2reader that interprets the frame in the buffer data that then callsback the handler that is the runnable itseld that then finds a http2stream by id to delegate the correct frame information to.
When turning off and then back on mobile data in android app, this Http2Connection.ReaderRunnable class stopped working, i no longer could breakpoint this runnable.
When the phone is running in the background for a few minutes, the socket is essentially disconnected, but RealConnection.isHealthy() is true, all requests will be TimeoutException at this time, and the connecttion will always be in connection pool, subsequent requests will also be TimeoutException. Must re-kill the app to resolve
@swankjesse Thanks for quick response. We tried setting
pingInterval(1, TimeUnit.SECOND)
and it seems it is behaving properly now. I don’t want to say it’s fixed yet as we need to do more testing, but will report back after a few days.@c0dehunter try setting a ping interval on your OkHttpClient?
https://square.github.io/okhttp/3.x/okhttp/okhttp3/OkHttpClient.Builder.html#pingInterval-long-java.util.concurrent.TimeUnit-
it clears out on the second call. It looks like what happens is the pool gets zombie connections. Next time you grab one of the zombies out of the pool, it throws that exception but is removed. The original bug was that the zombies got stuck in the pool. That said, this isn’t great behavior either, so we’ve just left the pool size at zero…
Any updates for a fix?
@alessandrojp do you know if the ExoPlayer team is aware of this issue? We’ve run into it only with exoplayer as well.
So our attempts to write to the socket are failing silently? Might need to steal the automatic pings that we added for web sockets.