tesserocr: Cannot get text with French language

Hi !

I’m trying to use tesserocr with french language but I keep getting errors on Unicode decoder

api = PyTessBaseAPI(lang='fra') api.SetImage(Image.open("20170509_182040.jpg")) api.SetSourceResolution(300) api.GetUTF8Text() Returns: Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “tesserocr.pyx”, line 2033, in tesserocr.PyTessBaseAPI.GetUTF8Text (tesserocr.cpp:18137) File “tesserocr.pyx”, line 294, in tesserocr._free_str (tesserocr.cpp:2567) UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc3 in position 341: invalid continuation byte

Although the english version is working:

api = PyTessBaseAPI() api.SetImage(Image.open(“20170509_182040.jpg”)) api.SetSourceResolution(300) api.GetUTF8Text() Returns : ‘The text that I want’

This is my installation :

tesserocr.version ‘2.1.3’

tesserocr.tesseract_version() ‘tesseract 3.05.00\n leptonica-1.74.1\n libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8\n’

MacOS Sierra

Is it a known issue or do I need to change something to get it to work ?

Thanks for your help !

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 22 (9 by maintainers)

Most upvoted comments

Both ubuntu and mac use en_US.UTF-8. It’s magic.

@sirfz Yes it solved the problem thanks for your help 😉