nbconvert: Bleach seems to be significantly slower than lxml in 7.1.x+
Description
After updating nbconvert from 6.5.0 to 7.2.2 for use here at GitHub we discovered the notebook rendering service degraded in render time (almost twice as long to render). We looked into what might have caused this and it looks as though the switch to bleach for sanitization is the culprit here. I’m wondering if we could provide a configuration for users to pass
that would allow them to choose the cleaning library they would prefer to use considering it could have a major performance impact.
In the meantime our team should be able to safely upgrade to 7.0.0 from 6.5.0 though moving forward we’d love to keep in close sync with the latest releases, a couple steps behind if not update to date.
I recognize this may have been a deliberate design decision, I’m interested in looking into making an PR myself. Any pointers/recommendations would be appreciated 🙇🏾 .
Relevant PR: https://github.com/jupyter/nbconvert/pull/1854
Screenshots
7.0.0 Profile

7.0.0 Flamegraph

7.2.2 Profile
7.2.2 Flamegraph

About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16 (13 by maintainers)
As an update,
nh3 0.2.0is now sufficiently configurable to do what bleach is doing to HTML… except for the contents ofstyleattributes/tags, which can contain all kinds of evil things.It looks like a maximally secure approach would still have to use
bleachifstyleappeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.Further, as the stdlib XML parser is used in a number of parsing-related places, one might still want to prefer having
lxmlaround, even ifnh3was used for the cleaning process.So there’s certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g.
asvin place would be a big step forward.For the time being, a workaround, if you’re willing to try it, is to patch
lxmlback into place in the filter dict:This essentially reverts the change made to the filters dict in b40bb13a44d5632f20e8bb6d13049dccb6dbfa75.
It should be noted, though, that
lxml.clean_htmlmay be going extinct: https://bugs.launchpad.net/lxml/+bug/1958539Of note: bleach 6.0 out (with some breaking changes, the impact of which I haven’t yet assessed), but is also now officially deprecated. We’ll need to do something within the next year or so.