cbor2: unexpected exceptions raised while parsing untrusted inputs using cbor2.loads

Things to check first

I have searched the existing issues and didn’t find my bug already reported there
I have checked that my bug is still present in the latest release

cbor2 version

5.5.1

Python version

3.10.12

What happened?

I have a script which is parsing untrusted data using the cbor2.loads method. This script is trying to verify if the provided data is cbor encoded.

The implementation was as follows: try: cbor2.loads(b'\x959;{{{{{{{{{{{{{') except CBORDecodeError: print('no cbor encoded')

For some inputs, I’ve noticed that MemoryError is raised instead of CBORDecodeError.

To better understand the problem and ensure that this is only one strange case while parsing untrusted data I’ve run fuzzer against cbor2.loads method.

It seems that the cbor2.loads method is not able to parse untrusted data properly - in the worst case cbor2 is trying to allocate the whole memory - ref to the `MemoryError’ case presented in the code above.

I was able to find following exceptions raised by cbor2.loads (all reproduced using cbor2 5.5.1/python 3.10.12/Ubuntu 20.4):

# OverflowError: timestamp out of range for platform time_t
cbor2.loads(b'\xc1\x1b\x9b\x9b\x9b\x00\x00\x00\x00\x00')
# OSError: OSError: [Errno 75] Value too large for defined data type
cbor2.loads(b'\xc1\x1b\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16')
# MemoryError:
cbor2.loads(b'\x959;{{{{{{{{{{{{{')
# TypeError: object of type 'int' has no len()
cbor2.loads(b'\xd8%\x00\x10`\x00\x00\x00`\x10\x00\x00\x00\x00\x00\x00')
# SystemError: <built-in function loads> returned NULL without setting an error
cbor2.loads(b'\xd8\x1e\x84\xff\xff\xff\xff')
# re.error: unbalanced parenthesis at position 0
cbor2.loads(b'\xd8#A)')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 1: invalid start byte
cbor2.loads(b'b\n\xff')

I was trying to analyze how it could be improved but it is not an easy task for somebody who does not maintain this code. Is it possible to improve it somehow?

The expected and ideal solution would be to have CBORDecodeError raised in case of not valid input cbor data.

How can we reproduce the bug?

Code to reproduce mentioned exceptions:

import cbor2
cbor2.loads(b'\xc1\x1b\x9b\x9b\x9b\x00\x00\x00\x00\x00')
cbor2.loads(b'\xc1\x1b\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16\x16')
cbor2.loads(b'\x959;{{{{{{{{{{{{{')
cbor2.loads(b'\xd8%\x00\x10`\x00\x00\x00`\x10\x00\x00\x00\x00\x00\x00')
cbor2.loads(b'\xd8\x1e\x84\xff\xff\xff\xff')
cbor2.loads(b'\xd8#A)')
cbor2.loads(b'b\n\xff')

cbor2 has been testes using atheris fuzzer and the following code:

import sys
import atheris
import pprint

with atheris.instrument_imports():
    import cbor2


EXCEPTIONS = {}
pp = pprint.PrettyPrinter(indent=4)

def fuzz_cbor2(data):
    try:
        cbor2.loads(data)
    except cbor2.CBORError:
        # CBORError is expected for some data
        pass
    except Exception as e:
        if type(e) not in EXCEPTIONS:
            EXCEPTIONS[type(e)] = data.hex()
            print(f"Found new exception {e}")
            print("************** status *************")
            pp.pprint(EXCEPTIONS)


if __name__ == "__main__":
    atheris.Setup(sys.argv, fuzz_cbor2)
    atheris.Fuzz()

About this issue

Original URL
State: closed
Created 6 months ago
Comments: 27 (27 by maintainers)

Most upvoted comments

I’ve fixed the problems originally reported here. I believe that, to fix all the problems thoroughly, a rewrite would be needed, but I don’t have the bandwidth for that, and I have to draw the line somewhere in order to move on to other projects. I’ve released v5.6.0 which contains these fixes.

agronholm on Jan 17, 2024

What difference would that make, and to whom?

Good question, I would suspect that downstream consumers of this library would want to know if they’re running a potentially insecure version (i.e. < v5.6.0). The easiest way to accomplish that would probably be issuing a security advisory, which would then automatically be picked up by tools like Dependabot and Snyk 👍

mschwager on Jan 30, 2024

Great minds think alike! I was actually fuzzing cbor2 with Atheris very recently too. However, I focused on the C implementation and looked for memory corruption bugs. I did manage to find at least one, and there may be more. I would recommend setting up regular fuzz testing for this project. Here’s what I came up with.

My Dockerfile for reproduction (note some paths may have to change, like aarch64):

FROM debian:12-slim

RUN apt update && apt install -y \
    clang \
    git \
    python3-full \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

RUN python3 --version

RUN mkdir /app
WORKDIR /app

# Subject to change by upstream
# https://github.com/google/atheris/issues/36
ENV LIBFUZZER_LIB "/usr/lib/llvm-14/lib/clang/14.0.6/lib/linux/libclang_rt.fuzzer_no_main-aarch64.a"

ENV VIRTUAL_ENV "/opt/venv"
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH "$VIRTUAL_ENV/bin:$PATH"

# https://github.com/google/atheris#building-from-source
RUN python3 -m pip install --no-binary atheris atheris
RUN git clone https://github.com/google/atheris.git
RUN python3 -m pip install atheris/

# https://github.com/google/atheris/blob/master/native_extension_fuzzing.md#step-1-compiling-your-extension
ENV CC "/usr/bin/clang"
ENV CFLAGS "-fsanitize=address,undefined,fuzzer-no-link"
ENV CXX "/usr/bin/clang++"
ENV CXXFLAGS "-fsanitize=address,undefined,fuzzer-no-link"
ENV LDSHARED "/usr/bin/clang -shared"

# https://github.com/agronholm/cbor2
ENV CBOR2_BUILD_C_EXTENSION "1"
RUN git clone https://github.com/agronholm/cbor2.git
RUN python3 -m pip install cbor2/

# Subject to change by upstream, but it's just a sanity check
RUN nm "$VIRTUAL_ENV/lib/python3.11/site-packages/_cbor2.cpython-311-aarch64-linux-gnu.so" \
    | grep asan \
    && echo "Found ASAN" \
    || echo "Missing ASAN"

# Allow Atheris to find fuzzer sanitizer shared libs
# https://github.com/google/atheris/blob/master/native_extension_fuzzing.md#option-a-sanitizerlibfuzzer-preloads
ENV LD_PRELOAD "$VIRTUAL_ENV/lib/python3.11/site-packages/asan_with_fuzzer.so"

# Skip memory allocation failures for now
ENV ASAN_OPTIONS "allocator_may_return_null=1"

COPY fuzz.py fuzz.py
ENTRYPOINT ["python", "fuzz.py"]

And my fuzz harness:

#!/usr/bin/python3

import sys
import atheris

with atheris.instrument_imports():
    # _cbor2 ensures the C library is imported
    from _cbor2 import loads


# Inspired by: https://github.com/google/oss-fuzz/blob/master/projects/ujson/ujson_fuzzer.py
def TestOneInput(data):
    try:
        loads(data)
    except Exception:
        # We're searching for memory corruption, not Python exceptions
        pass


def main():
    # Since everything interesting in this fuzzer is in native code, we can
    # disable Python coverage to improve performance and reduce coverage noise.
    atheris.Setup(sys.argv, TestOneInput, enable_python_coverage=False)
    atheris.Fuzz()


if __name__ == "__main__":
    main()

Build, then run the Docker image:

$ docker build -t cbor2-fuzz -f Dockerfile
$ docker run -v $(pwd):/tmp/output/ cbor2-fuzz -artifact_prefix=/tmp/output/

This then produces a crash like:

...
SUMMARY: AddressSanitizer: SEGV /usr/include/python3.11/object.h:537:9 in Py_DECREF
==1==ABORTING
MS: 2 CMP-CrossOver- DE: "\001\010\000\000"-; base unit: 1b850b729fc7234e8dcac9406224b06d235affb7
0xae,0xae,0xae,0xae,0xae,0xae,0xae,0xae,0xae,0x1,0x8,0xc2,0x98,0x43,0xd9,0x1,0x0,0xd8,0x24,0x9f,0x0,0x0,0xae,0xae,0xff,0xc2,0x6c,0xa7,0x99,
\256\256\256\256\256\256\256\256\256\001\010\302\230C\331\001\000\330$\237\000\000\256\256\377\302l\247\231
artifact_prefix='/tmp/output/'; Test unit written to /tmp/output/crash-95e879135b949a863283a11eaf98bb7b3b109783
Base64: rq6urq6urq6uAQjCmEPZAQDYJJ8AAK6u/8Jsp5k=

Which we can confirm like so:

$ echo -n "rq6urq6urq6uAQjCmEPZAQDYJJ8AAK6u/8Jsp5k=" | python -m cbor2.tool -d
Segmentation fault: 11

This appears at the following location:

==1==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0xffff96e7cc38 bp 0xfffffec31660 sp 0xfffffec31580 T0)
==1==The signal is caused by a READ memory access.
==1==Hint: address points to the zero page.
    #0 0xffff96e7cc38 in Py_DECREF /usr/include/python3.11/object.h:537:9
    #1 0xffff96e7cc38 in decode_definite_string /app/cbor2/source/decoder.c:653:9
    #2 0xffff96e7cc38 in decode_string /app/cbor2/source/decoder.c:718:15
    #3 0xffff96e79778 in decode /app/cbor2/source/decoder.c:1735:27
...

Which seems to be this code:

https://github.com/agronholm/cbor2/blob/850545ca33c1541de397ef2e6c6e1af221d4a0f8/source/decoder.c#L653

I’m not sure about exploitability here. Memory corruption in C code has more potential for exploitation than Python exceptions. I also did notice this big warning in the Py_DECREF docs. I’m not sure if that’s applicable in this situation, but again, it’s cause for concern.

mschwager on Dec 26, 2023

The second largest concern is that MemoryError, as it has the potential for a DoS attack.

agronholm on Dec 20, 2023

The most concerning error is that SystemError that says it returned NULL without setting an error. This looks like a bug in the C decoder implementation.

agronholm on Dec 20, 2023