nbconvert: Bleach seems to be significantly slower than lxml in 7.1.x+
Description
After updating nbconvert
from 6.5.0 to 7.2.2 for use here at GitHub we discovered the notebook rendering service degraded in render time (almost twice as long to render). We looked into what might have caused this and it looks as though the switch to bleach
for sanitization is the culprit here. I’m wondering if we could provide a configuration for users to pass
that would allow them to choose the cleaning library they would prefer to use considering it could have a major performance impact.
In the meantime our team should be able to safely upgrade to 7.0.0 from 6.5.0 though moving forward we’d love to keep in close sync with the latest releases, a couple steps behind if not update to date.
I recognize this may have been a deliberate design decision, I’m interested in looking into making an PR myself. Any pointers/recommendations would be appreciated 🙇🏾 .
Relevant PR: https://github.com/jupyter/nbconvert/pull/1854
Screenshots
7.0.0 Profile
7.0.0 Flamegraph
7.2.2 Profile
7.2.2 Flamegraph
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16 (13 by maintainers)
As an update,
nh3 0.2.0
is now sufficiently configurable to do what bleach is doing to HTML… except for the contents ofstyle
attributes/tags, which can contain all kinds of evil things.It looks like a maximally secure approach would still have to use
bleach
ifstyle
appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.Further, as the stdlib XML parser is used in a number of parsing-related places, one might still want to prefer having
lxml
around, even ifnh3
was used for the cleaning process.So there’s certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g.
asv
in place would be a big step forward.For the time being, a workaround, if you’re willing to try it, is to patch
lxml
back into place in the filter dict:This essentially reverts the change made to the filters dict in b40bb13a44d5632f20e8bb6d13049dccb6dbfa75.
It should be noted, though, that
lxml.clean_html
may be going extinct: https://bugs.launchpad.net/lxml/+bug/1958539Of note: bleach 6.0 out (with some breaking changes, the impact of which I haven’t yet assessed), but is also now officially deprecated. We’ll need to do something within the next year or so.