numexpr: Warn that evaluate() should not be used on user input

The evaluate() function eventually calls eval() on the data provided. eval() is extremely dangerous when supplied with user input and to my knowledge it isn’t mentioned that the function does this. I would add a warning in the documentation about this. As a proof-of-concept, the following code should execute the command ‘echo verybad’ on your computer when ran:

import numexpr
s = """
(lambda fc=(
    lambda n: [
        c for c in 
            ().__class__.__bases__[0].__subclasses__() 
            if c.__name__ == n
        ][0]
    ):
    fc("function")(
        fc("Popen")("echo verybad",shell=True),{}
    )()
)()
"""
numexpr.evaluate(s)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 60 (36 by maintainers)

Commits related to this issue

Most upvoted comments

I am in the process to release 2.8.7, with the suggestions here. If you want to test how the candidate looks like, please go to https://github.com/pydata/numexpr/pull/453 and give it a try. My plan is to do a release as soon as possible (hopefully by tomorrow).

Also, and after talking with @robbmcleod , I have added an advert in the README where it is said that the project is looking for (much needed) new maintainers. If anyone here is ready for tackling that, please speak. Thanks!

@robbmcleod what, if anything, blocks the above two fixes from being applied, to at least fix the known unnecessary breakage? (See #452 if you prefer a proper pull request.)

In the longer term, it looks to me like everyone here agrees that moving to ast would be better, but will take work.

I may be interested in becoming a maintainer and/or contributing to that work, but this is not a promise at this point.

One can also take the opportunity to produce wheels for forthcoming Python 3.12. Although 3.12 is not final yet (Oct, 2nd is the tentative date), Python folks will not be introducing ABI changes after existing 3.12rc2, so extensions built on it should work well with the forthcoming 3.12 final. Also, the NumPy team is already producing wheels for 3.12, so this dependency should be ready too.

@robbmcleod I’m willing to help in doing the release in case you don’t have lots of time right now. BTW, thanks for all the time that you have put in the project so far; you have done a most excellent job in maintaining the project.

Perhaps a more reliable approach would be to use ast.parse, and reject the tree if we find statements, lambdas, dunder import, etc?

Maybe this issue can be closed now, right?

Alright, I’m going to try and provide a bit of history of this project because we have so many people coming here without context.

NumExpr (NE) was originally written by David Cooke in the late 2000s for Python 2.5 as a way to accelerate NumPy calculations. He’s no longer involved in open source and no one knows where he is or how to contact him. Because NE predates many things in modern Python, the source has a lot of technical debt and it implements its own Abstract Syntax Tree (AST), because the ast module didn’t exist back then (being a Python 2.7 innovation). NE turned 15-years old this year.

Incidentally the ast module documentation is still incomplete dogshit in 2023 and one should look at:

https://greentreesnakes.readthedocs.io/en/latest/

, if you want to understand how it actually works.

Francesc Alted took over maintenance in that void because PyTables used NumExpr to do queries. I was using NumExpr 2015-16 to accelerate some scientific calcs without having to implement customer functions in the C-API all the time. After I quit my Swiss post-doc and moved back to Canada I started on a project to make “NumExpr 3.0” which had the potential to fix all the shortfalls in 2.0. However, it was an (overly) ambitious project, and I got a paying job, and my free time evaporated. The 3.0 branch does use the ast module to parse expressions. However, I completely re-wrote the Python part of NE because it’s frankly a mess of mutable arguments going into the NE AST and I added the ability to parse multiple lines with temporary variables (which existed in the virtual machine as just a single 4k block) among numerous other improvements. Fransesc asked me if I could take over the maintenance side of things and I agreed, since it was the reasonable thing to do.

My thinking at that time was that people would stop using NE and switch to Numba. I saw NE3 as having a potentially niche for “write once; execute on big data” scripts where the user didn’t want to unravel their vectorized calculations to use Numba.  Little did I know at that time that a lot of people were using NE2 for a purpose it was never (IMO) intended to be used for: parsing user inputs. This is very clear if you look at the myriad of reported issues on this repo where people are using NumExpr to parse singletons (and hence running into issues with the NE AST), whereas in my opinion NE was intended to be a blocked-calculation virtual machine to avoid being calculation limited by memory bandwidth. NE is extremely inefficient for parsing singleton inputs. CPython itself is more efficient. I’ve tried repeatedly in the Issues tracker to discourage people from using NE in this way, to no avail.

I personally do not have the bandwidth to implement a new AST in the package. In 2017 yes, but not in 2023. Now, if someone wants to take over maintenance of the project, I’d be thrilled to hand it over. It’s possible money could be sourced from one of the open-source funding agencies. The adjacent packages: NumPy, Pandas, and PyTables are all funded. Francesc has approached me in the past asking if I wanted to be part of one of the grant applications, and I said no. I’ve also been asked to consult for companies using NumExpr internally and again I said no (although this has dried up over the past couple of years). For me, it’s not about the money, it’s about my personal time.

If I ask someone, “can you please write a unit test for this edge case?” and I get a no code response, I’m not going to be able to fix the problem in 15 minutes on my lunch break. To be clear: I don’t use NumExpr in a professional context. I wrote my own virtual machine for professional use. There’s no benefit to me in continuing to maintain this project.

I came across this from a test failure in my X-ray data analysis codes that uses a library (pyFAI) that uses numexpr.NumExpr on an expression like ‘4.0e-9*x’) (see #449).

I am shocked to learn that numexpr uses eval, and somewhat alarmed at the simplistic approaches proposed here to disallow dunder names.

Trying to parse Python expressions yourself is foolhardy, especially since Python exposes its own parser with ast. You might, for example, consider replacing eval() with the asteval module (https://github.com/newville/asteval). It is true that I am the author, but this is far from a shameless plus to use code: I support it so that my other codes can work safely.

With asteval you could certainly throw out many of the “supported nodes” you do not like (loops, conditionals, etc) and use it only for evaluations of expressions. You could join the discussion about what attributes of Python objects are unsafe. For details, see the list of disallowed attributes at https://newville.github.io/asteval/motivation.html#how-safe-is-asteval.

I added the means to turn the sanitize=True default behavior off, by setting an environment variable,

set NUMEXPR_SANITIZE=0

Generally speaking I think this shouldn’t be any more so a security hole than allowing people to pass sanitize=False is.

I tested with pandas against the tests I found that referenced numexpr or evaluate and they all passed. I wasn’t able to run the full pandas test suite as I had some access violation.

Otherwise everything is good to release 2.8.6. I’ll give everyone a day to comment.

If no one can think of adding any more tests to this I’ll prepare another release?

I’ll try and test locally against Pandas as well.

@lithomas1 and company,

I see a few approaches here.

  1. We can deprecate the implementation of the __ filter for a immediate release and put it back in for a future release.
  2. We can put in an option to disable the security check. However, I do think it should default to be on.

In order to avoid forcing everyone to do an emergency release, we probably need to do both, assuming the option defaults to sanitize.