pdfminer.six: pdf text bbox don't match its real location.
Hi, I am using pdfminer.six==20181108,
I found this recently that: Sometimes a parsed PDF will have plenty of "\t"s in its texts. when that happens, the LRTextBoxes are not matching the real text location on the pdf page (char bbox are too large for that char itself, and covering adjacent char locations.)
Is there a way to solve this by setting up some params in LAParams? I guess? or has anyone else met this issue before.
this is my simple function for pdf text extraction.
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
def load_pdf(pdf_stream):
doc = PDFDocument(PDFParser(pdf_stream))
rsrcmgr = PDFResourceManager()
device = PDFPageAggregator(rsrcmgr, laparams=LAParams(detect_vertical=False))
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.create_pages(doc)
pgs = []
for _, pg in enumerate(pages):
interpreter.process_page(pg)
pgs.append(device.get_result())
return pgs```
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 23 (7 by maintainers)
@jstockwin unfortunately I can’t share it publicly, but I can use it for testing on my end. Thanks for the pointer - I’ll check it out!
@zangell44 Are you also able to share said PDF to help with debugging?
If you wanted to take a look, I suspect a good place to start would be here: https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/layout.py#L292. I’m not particularly familiar with that part of the code so can’t give you any more pointers than that, I’m afriad…