rust-snappy: Seems to be incompatible with python and java versions of snappy

A simple working example demonstrating szip and python’s python-snappy wheel are incompatible:

  • Assume a file “data.json” exists
  • In python3, write
data = open('data.json', 'r').read()
c = snappy.compress(data.encode('UTF-8'))
open('data.json.sz', 'wb').write(c)```
* This results in a binary file on disk. Running `szip -d data.json.sz` yields: 
```$ szip -d out.json.sz
out.json.sz: snappy: corrupt input (expected stream header but got unexpected chunk type byte 148)```
* However python's snappy thinks the file is fine; the following gives the original contents right back: 
```import snappy
snappy.decompress(open('data.json.sz', 'rb').read())```

So szip cannot interpret python's compressed output. The reverse is also true:

* Assume a file "data.json" exists (and data.json.sz does not)
* Run `szip data.json`
* In python3, write 
```import snappy
data = open('data.json.sz', 'rb').read()
c = snappy.decompress(data)``` to get this error:

Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/usr/local/lib/python3.7/site-packages/snappy/snappy.py”, line 92, in uncompress return _uncompress(data) snappy.UncompressError: Error while decompressing: invalid input```

I should add this isn’t just a quirk of python; the behavior of the Java library I’m using ( https://github.com/xerial/snappy-java ) works fine and is interoperable with python’s snappy. So this seems to be the odd one out.

I don’t know much about the internals of this algorithm but maybe it’s possible they’re implementing different specifications of the snappy format?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (9 by maintainers)

Most upvoted comments

Thanks @BurntSushi for taking the time to answer here, and of course for making the library in the first case.

Agreed, I came here advocating a similar API for raw as there is for framed; it was at least on topic for this thread, but it has anyway descended into a narrow winding road. Therefore, I’ll concede to your expertise and take some time to reflect on your informative response and probably life choices in general…

If I boil down a feature request for a specific use case, I’ll open a new issue like a normal person. 😉

Thanks for your patience.

Hi @BurntSushi I just read your blog post and it really spoke to me, so I would like to say a little bit about how we used this crate at TriNetX and how much it helped us.

We originally had a python microservice that did some CPU- and memory-intensive data munging. Somewhat suddenly, the size of data it needed to process multiplied by a factor of 20, and it simply couldn’t keep up. Not wanting to scale up our instance size for no reason, we rewrote it in Rust. The memory benefits were immense, but the total processing time didn’t go down enough. Profiling showed that 60% of our time was spent gzipping and g-unzipping data, which is simply not acceptable.

A few google searches later and we discovered snappy, but didn’t realize that snappy was actually two incompatible formats (framed and un-framed) since the python and java libraries both had compatible defaults. When I got to getting a rust snappy library, the bytes simply weren’t compatible, and there was no obvious reason why, so I made this issue.

In a very short span of time, you helped me to diagnose the issue, understand the underlying issues, and fix everything so that it’s compatible. Your library now works beautifully in production, and upon profiling, we spend approximately 5% of our time de/compressing data, since your library is so fast. We were able to significantly improve the user experience and increase the amount of data they can request at a time, allowing them to do better and more legitimate medical data analyses.

I don’t know if this issue was one of the “low effort negative issues” that you get faced with, or simply some frustrating homework dropped on you by a stranger on the internet, but your crate – and therefore you as a maintainer – had a huge positive effect on me, and on my team. Thank you.

That’s a good point. When I started with this I didn’t realize snappy had two formats (framed and non-framed). For whatever reason this project defaults to framed and the python and java projects default to non-framed, and it’s not obvious (until pointed out) how to deal with that.

I’ve converted everything to framed and it seems to work quite well, so I’ll close the issue. Thanks.

szip uses the snappy frame format by default. It’s not clear to me from this bug repoet whether you’ve accounted for that or not. Sounds like you should be using the stream compression functions in python-snappy, or by using the raw format in szip.