pdfplumber: UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to
Describe the bug
With the following code:
import sys
import pdfplumber
filename = sys.argv[1]
# print("Filename: " + filename)
with pdfplumber.open(filename) as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
And the following sample file: 2.pdf
I receive the error:
Traceback (most recent call last):
File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 10, in <module>
print(text)
File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>
And python returns with status code 1.
Have you tried repairing the PDF?
Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.
I just modified the code and ran with , repair=True and received a different error. This error continued even after installing Ghostscript.
Traceback (most recent call last):
File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 7, in <module>
with pdfplumber.open(filename, repair=True) as pdf:
File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfplumber\pdf.py", line 78, in open
stream = _repair(path_or_fp, password=password)
File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfplumber\repair.py", line 15, in _repair
raise Exception(
Exception: Cannot find Ghostscript, which is required for repairs.
Visit https://www.ghostscript.com/ for installation instructions.
Code to reproduce the problem
Without repair=True
import sys
import pdfplumber
filename = sys.argv[1]
# print("Filename: " + filename)
with pdfplumber.open(filename) as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
With repair=True
import sys
import pdfplumber
filename = sys.argv[1]
# print("Filename: " + filename)
with pdfplumber.open(filename, repair=True) as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
PDF file
Please attach any PDFs necessary to reproduce the problem.
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
What did you expect the result should have been?
Return the text from the document.
Actual behavior
What actually happened, instead?
Error message as shown above.
Screenshots
If applicable, add screenshots to help explain your problem.
N/A
Environment
- pdfplumber version: [e.g., 0.5.22] --> 0.10.2
- Python version: [e.g., 3.8.1] --> 3.10.4
- OS: [e.g., Mac, Linux, etc.] --> Windows 11
Additional context
Add any other context/notes about the problem here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (7 by maintainers)
Thank you, @jchristn. Both of your notes are very helpful. I’ll investigate and update you here.