nbconvert: Bleach seems to be significantly slower than lxml in 7.1.x+

Description

After updating nbconvert from 6.5.0 to 7.2.2 for use here at GitHub we discovered the notebook rendering service degraded in render time (almost twice as long to render). We looked into what might have caused this and it looks as though the switch to bleach for sanitization is the culprit here. I’m wondering if we could provide a configuration for users to pass that would allow them to choose the cleaning library they would prefer to use considering it could have a major performance impact.

In the meantime our team should be able to safely upgrade to 7.0.0 from 6.5.0 though moving forward we’d love to keep in close sync with the latest releases, a couple steps behind if not update to date.

I recognize this may have been a deliberate design decision, I’m interested in looking into making an PR myself. Any pointers/recommendations would be appreciated 🙇🏾 .

Relevant PR: https://github.com/jupyter/nbconvert/pull/1854

Screenshots

7.0.0 Profile Screen Shot 2022-10-27 at 2 32 22 PM

7.0.0 Flamegraph Screen Shot 2022-10-27 at 2 31 18 PM

7.2.2 Profile Screen Shot 2022-10-27 at 12 26 31 PM 7.2.2 Flamegraph Screen Shot 2022-10-27 at 12 26 14 PM

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 16 (13 by maintainers)

Commits related to this issue

Most upvoted comments

As an update, nh3 0.2.0 is now sufficiently configurable to do what bleach is doing to HTML… except for the contents of style attributes/tags, which can contain all kinds of evil things.

It looks like a maximally secure approach would still have to use bleach if style appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.

Further, as the stdlib XML parser is used in a number of parsing-related places, one might still want to prefer having lxml around, even if nh3 was used for the cleaning process.

So there’s certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. asv in place would be a big step forward.

For the time being, a workaround, if you’re willing to try it, is to patch lxml back into place in the filter dict:

from nbconvert.exporters.templateexporter import default_filters
from lxml.html.clean import clean_html
default_filters["clean_html"] = clean_html

This essentially reverts the change made to the filters dict in b40bb13a44d5632f20e8bb6d13049dccb6dbfa75.

It should be noted, though, that lxml.clean_html may be going extinct: https://bugs.launchpad.net/lxml/+bug/1958539

Of note: bleach 6.0 out (with some breaking changes, the impact of which I haven’t yet assessed), but is also now officially deprecated. We’ll need to do something within the next year or so.