tesseract: LSTM training process broken with new unicharset_extractor
tesseract 4.00.00dev-658-g3493785-2149 leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8
Found AVX Found SSE
Error while extracting unicharset
=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Sep 9 16:08:44 DST 2017] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.4s8doNdbQW/san_latn/san_latn.unicharset --norm_mode 1 /tmp/tmp.4s8doNdbQW/
san_latn/san_latn.Arial_Unicode_MS.exp0.box /tmp/tmp.4s8doNdbQW/san_latn/san_latn.FreeSerif.exp0.box /tmp/tmp.4s8doNdbQW/san_latn/san_latn.FreeSerif_Italic.exp0.box /tmp
/tmp.4s8doNdbQW/san_latn/san_latn.Sanskrit_2003.exp0.box /tmp/tmp.4s8doNdbQW/san_latn/san_latn.Siddhanta.exp0.box /tmp/tmp.4s8doNdbQW/san_latn/san_latn.Times_New_Roman_I
talic.exp0.box
Extracting unicharset from box file /tmp/tmp.4s8doNdbQW/san_latn/san_latn.Arial_Unicode_MS.exp0.box
Invalid Unicode codepoint: 0xffffffc3
IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
ERROR: /tmp/tmp.4s8doNdbQW/san_latn/san_latn.unicharset does not exist or is not readable
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 52 (7 by maintainers)
No error messages or asserts does not necessarily mean that the new code works. See my comment for the pull request.