okhttp: okhttp fails with IOException: gzip finished without exhausting source but GZIPInputStream works

What kind of issue is this?

This issue can’t be reproduced in a test. I’ll do my best to explain.

>> GET http://myserver.mycompany.com/.../businesses.20180104.json.gz
<< 200 OK
<< connection -> [keep-alive]
<< accept-ranges -> [bytes]
<< content-disposition -> [attachment; filename="businesses.20180104.json.gz"; filename*=UTF-8''businesses.20180104.json.gz] 
<< content-type -> [application/x-gzip]
<< content-length -> [3384998203]
<< date -> [Fri, 05 Jan 2018 00:43:32 GMT]
<< etag -> [0e49d5fa7ba9f68058bfbb4a98bef032c3a73871]
<< last-modified -> [Thu, 04 Jan 2018 23:54:26 GMT]
<< x-artifactory-id -> [9732f56568ea1e3d:59294f65:160b8066066:-8000]
<< x-checksum-md5 -> [451ca1b1414e7b511de874e61fd33eb2]
<< x-artifactory-filename -> [businesses.20180104.json.gz]
<< server -> [Artifactory/5.3.0]
<< x-checksum-sha1 -> [0e49d5fa7ba9f68058bfbb4a98bef032c3a73871]

As you can see, the server doesn’t set a Content-Encoding = gzip header, so I do that in an interceptor.

Each record is newline delimited JSON string that is inserted into Couchbase. There are around 12 million records in total. Using okhttp, processing fails after about 130000 with the following exception:

Caused by: java.io.IOException: gzip finished without exhausting source
	at okio.GzipSource.read(GzipSource.java:100)
	at okio.RealBufferedSource$1.read(RealBufferedSource.java:430)

However, if I don’t set the Content-Encoding header (thus skipping GzipSource), and wrap the input stream with GZIPInputStream, everything works as expected. I’ve also tried setting Transfer-Encoding = chunked on the response and removing the Content-Length header, but to no avail.

So, question is, if GZIPInputStream doesn’t have a problem, why does GzipSource? And since it does, why won’t it report what it thinks is the issue? I’ve a test that runs on a smaller file, 100 records, and it works.

I’ve seen https://github.com/square/okhttp/issues/3457, but unlike the reporter, it’s not possible for me to capture the hex body of a 3.4 GB stream.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 44 (12 by maintainers)

Most upvoted comments

I introduced this behavior and can explain it.

Gzip is a self-terminating format. The content of the stream itself indicates where you’ve read everything.

If ever there’s data beyond the self-reported end, this data is effectively unreachable. This is potentially problematic for two reasons:

  • HTTP/1 connection pooling. If we don’t consume the entire response body of call N, we can’t use the connection for call (N+1).
  • HTTP response caching. We only persist response values once they’re completely downloaded.

I made things strict to help to detect problems like this. It’s possible this check is too strict and we should silently ignore the extra data.

@yschimke I think there was a mistake in how I took the previous hex dump. This time, I did the following:

1. protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) throws Exception {
2.    if (finished) {
3.        in.skipBytes(in.readableBytes());
4.        return;
5.    }
    ...
}

Put a breakpoint on line 3 of JdkZlibDecoder.decode above. For every invocation of it, dump the contents of the ByteBuf to a file by manually invoking the following method that I wrote: ByteBufUtils.dumpByteBuf("yelp-dump.txt", in)

public static void dumpByteBuf(String out, ByteBuf msg) {
    StringBuilder buf = new StringBuilder(StringUtil.NEWLINE);
    appendPrettyHexDump(buf, msg);

    try (BufferedWriter w = newBufferedWriter(Paths.get(out), UTF_8, APPEND, CREATE)) {
        w.write(buf.toString());
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
}

That produced the attached dump, and I see that it starts with 1f 8b, as well contains the same sequence more than once. Does this prove my theory of multiple streams?

yelp-dump.txt