pikepdf: pikepdf will have failed test with qpdf 10.6 but can be fixed without breaking compatibility

When running pikepdf’s tests against qpdf 10.6, the following failures occur:

b = b'\x7f'

    @given(binary())
    def test_codec_involution(b):
        # For all binary strings, there is a pdfdoc decoding. The encoding of that
        # decoding recovers the initial string. (However, not all str have a pdfdoc
        # encoding.)
>       assert b.decode('pdfdoc').encode('pdfdoc') == b
E       AssertionError: assert b'\x9f' == b'\x7f'
E         At index 0 diff: b'\x9f' != b'\x7f'
E         Use -v to get the full diff

and

s = '\x1f'

    @given(text())
    def test_break_encode(s):
        try:
            encoded_bytes = s.encode('pdfdoc')
        except ValueError as e:
            allowed_errors = [
                "'pdfdoc' codec can't encode character",
                "'pdfdoc' codec can't process Unicode surrogates",
                "'pdfdoc' codec can't encode some characters",
            ]
            if any((allowed in str(e)) for allowed in allowed_errors):
                return
            raise
        else:
>           assert encoded_bytes.decode('pdfdoc') == s
E           AssertionError: assert '˜' == '\x1f'
E             Strings contain only whitespace, escaping them using repr()
E             - '\x1f'
E             + '˜'

tests/test_codec.py:52: AssertionError

This is most likely because of qpdf/qpdf#606 which added previously omitted Unicode conversions for PDF Doc Encoding code points 0x18 through 0x1f and 0x7f. If you want to test mapping to an invalid code point, you can pick something lower than 0x18. That should map to the invalid character. Anyway, I’m not sure what correct fix is for your test.

I plan to release qpdf 10.6 most likely tomorrow, February 8. I plan on preparing everything today. Other than version numbers and final release mechanics, qpdf’s main is what 10.6 will look like. At this moment, I haven’t yet updated configure.ac and libtool versions, but I will be doing that shortly.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 40 (40 by maintainers)

Most upvoted comments

I’d like to make a statement in my release announcement like, “Distributions: this version of qpdf resolves the test failures with pikepdf. A new version of pikepdf is also about to be released whose tests pass against qpdf 10.6.2.” Any objection?

Just now pushed…good thing I checked because I hadn’t actually pushed and would have been surprised not to see my release build finished when I get back…

jberkenbilt on Feb 16, 2022

@mara004 You piqued my curiosity, so I downloaded the file. There is something wrong with it – I commented on #288.

jberkenbilt on Feb 14, 2022

Mostly tests.

The problem was that qpdf (the C++ library pikepdf uses internally) was not handling certain characters correctly in PDF strings that were encoded in a certain way. The bug had been in qpdf for a very long time and was only discovered recently. While it could in principle affect real-life users of the library, it would only do so for certain relatively unusual operations, not including any of the transformations that qpdf/pikepdf are so often used for, and then only with very unusual files that encoded these characters in this somewhat unusual way. The characters that were not encoded properly were mostly stand-alone accents, which are quite rare on their own, and usually files that would need them would use a different encoding.

So, bottom line: while this could affect real-life use of the library, you could probably go the rest of your life and never encounter the combination of PDF file and use case where this bug would actually matter.

As for the failing tests, they do not indicate anything wrong with pikepdf itself. The tests were, in a sense, incorrect because they were relying on incorrect behavior of qpdf.

@jbarlow83 can add to this if needed for any details specific to pikepdf.

jberkenbilt on Feb 13, 2022