pdf.js: Text in SVG pages is not aligned to unicode codepoints
See issue https://github.com/mozilla/pdf.js/issues/8546
When creating HTML pages, the text is perfektly fine.
When creating SVG pages, the looks fine, but copy/paste shows only mojibake.
SVG pages can be used in <img src="/path/page.svg">
tags. They are often more handy than <iframe src="/path/page.html">
Steps to reproduce: PDF: r2l.pdf
gulp dist-install
node examples/node/pdf2svg.js /tmp/r2l.pdf
firefox svgdump/*.svg
Copy/paste the text to an editor.
@brendandahl commented, that this function already exists, but it is only called when creating HTML pages: https://github.com/mozilla/pdf.js/blob/ad74f6e7410420dc6ae27edc863a2ef906d77b57/src/core/fonts.js#L731
I would donate $500 for a fix. https://www.bountysource.com/issues/55452997-text-in-svg-pages-is-not-aligned-to-unicode-codepoints
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 27 (3 by maintainers)
Hey there. I’ve investigated the issue and I’ll most likely post a quick fix sometime this week. If I’m correct, for most documents the problem will disappear if I straighten out both the text and embedded fonts’ cmaps to use proper Unicode. Maybe even the html renderer will benefit from this if the selection layer will match the text shape more closely. I’ll then investigate possible edge cases, like having several fonts are baked into a single font file - this will likely require splitting the font file back into pieces. Suggestions are welcome, though.
For the above document, I believe you’ll need to support unicode ligatures (see https://github.com/mozilla/pdf.js/blob/c2cbeaa34d81bbb7cced856e2888867df587a1fa/src/core/fonts.js#L780) and we’ll have to create a GSUB table.