pdfplumber: UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to

Describe the bug

With the following code:

import sys
import pdfplumber

filename = sys.argv[1]
# print("Filename: " + filename)

with pdfplumber.open(filename) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

And the following sample file: 2.pdf

I receive the error:

Traceback (most recent call last):
  File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 10, in <module>
    print(text)
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>

And python returns with status code 1.

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

I just modified the code and ran with , repair=True and received a different error. This error continued even after installing Ghostscript.

Traceback (most recent call last):
  File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 7, in <module>
    with pdfplumber.open(filename, repair=True) as pdf:
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfplumber\pdf.py", line 78, in open
    stream = _repair(path_or_fp, password=password)
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfplumber\repair.py", line 15, in _repair
    raise Exception(
Exception: Cannot find Ghostscript, which is required for repairs.
Visit https://www.ghostscript.com/ for installation instructions.

Code to reproduce the problem

Without repair=True

import sys
import pdfplumber

filename = sys.argv[1]
# print("Filename: " + filename)

with pdfplumber.open(filename) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

With repair=True

import sys
import pdfplumber

filename = sys.argv[1]
# print("Filename: " + filename)

with pdfplumber.open(filename, repair=True) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

2.pdf

Expected behavior

What did you expect the result should have been?

Return the text from the document.

Actual behavior

What actually happened, instead?

Error message as shown above.

Screenshots

If applicable, add screenshots to help explain your problem.

N/A

Environment

  • pdfplumber version: [e.g., 0.5.22] --> 0.10.2
  • Python version: [e.g., 3.8.1] --> 3.10.4
  • OS: [e.g., Mac, Linux, etc.] --> Windows 11

Additional context

Add any other context/notes about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (7 by maintainers)

Most upvoted comments

Thank you, @jchristn. Both of your notes are very helpful. I’ll investigate and update you here.