okhttp: Infinite loop in connection retry
For some reason (I haven’t been able to reproduce this locally yet), under heavy load some of our okhttp clients might get into an infinite loop (eating up one or more CPU cores) when connection failure retry is enabled. It seems this affects some connections only in the connection pool.
If I disable connection failure retry then a SocketException
is thrown at https://github.com/square/okhttp/blob/master/okhttp/src/main/java/okhttp3/internal/connection/RealConnection.kt#L594 with the following stack trace:
java.net.SocketException: Socket is closed at java.net.Socket.setSoTimeout(Unknown Source)
at sun.security.ssl.BaseSSLSocketImpl.setSoTimeout(Unknown Source)
at sun.security.ssl.SSLSocketImpl.setSoTimeout(Unknown Source)
at okhttp3.internal.connection.RealConnection.newCodec$okhttp(RealConnection.kt:594)
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:83)
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
In this case a certain percentage of the requests fail constantly suggesting a stuck pooled faulty connection which is never evicted.
The only option is to restart the affected application.
I think the cause is related to request cancellation and concurrency. Might be the socket is closed right after creation by a different thread?
This issue has been observed since at least 3.14.0. Also happening with latest: 4.4.0
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 22 (1 by maintainers)
Commits related to this issue
- Add test demonstrating rogue connection seeder. https://github.com/square/okhttp/issues/5819 — committed to square/okhttp by dave-r12 4 years ago
- Flag connection as unhealthy when underlying socket is closed. https://github.com/square/okhttp/issues/5819 — committed to square/okhttp by dave-r12 4 years ago
- Flag connection as unhealthy when underlying socket is closed. https://github.com/square/okhttp/issues/5819 — committed to square/okhttp by dave-r12 4 years ago
- Add test demonstrating rogue connection seeder. https://github.com/square/okhttp/issues/5819 — committed to square/okhttp by dave-r12 4 years ago
Thanks for the release. Haven’t seen the issue with 4.5.0 for a week now.
@danielimre ahh right. I spotted one code path that might explain this. But it’s pretty extraordinary.
When the application gets in this state are any requests making it to the server? Or no requests make it?
Thanks @dave-r12 . Meanwhile, I also figured out. I tried several scenario but couldn’t reproduce the exact same issue locally. My best guess is still some race condition in error tracking and request cancellation. As the issue only surfaces when connection retries are enabled a possible workaround would be to limit the maximum retries. e.g. in
okhttp3.internal.http.RetryAndFollowUpInterceptor
.Unfortunately, I couldn’t spend more time on this issue and had to look for an alternative 😦
Thanks for the report. I’m disappointed to learn you’re seeing this with 4.4. With that update we tried to address issues similar to this.
Will try to reproduce and fix with urgency!