tesstrain: The trained language doesn't work on multi-lines

Greetings,

I have trained tesseract from scratch on a dataset of 100k lines (for one font type). I got New best char error = 1.187

At iteration 43291/86900/86900, Mean rms=0.315%, delta=0.532%, char train=2.044%, word train=3.886%, skip ratio=0%,  wrote checkpoint.

At iteration 43298/87000/87000, Mean rms=0.299%, delta=0.492%, char train=1.935%, word train=3.789%, skip ratio=0%,  wrote checkpoint.

At iteration 43304/87100/87100, Mean rms=0.279%, delta=0.429%, char train=1.802%, word train=3.404%, skip ratio=0%,  wrote checkpoint.

At iteration 43311/87200/87200, Mean rms=0.278%, delta=0.426%, char train=1.814%, word train=3.373%, skip ratio=0%,  wrote checkpoint.

At iteration 43322/87300/87300, Mean rms=0.252%, delta=0.337%, char train=1.187%, word train=2.861%, skip ratio=0%,  New best char error = 1.187Previous test incomplete, skipping test at iteration43224 wrote best model:data/PSC5/checkpoints/PSC51.187_43322.checkpoint wrote checkpoint.

At iteration 43344/87400/87400, Mean rms=0.255%, delta=0.361%, char train=1.272%, word train=2.976%, skip ratio=0%,  New worst char error = 1.272 wrote checkpoint.

At iteration 43356/87500/87500, Mean rms=0.25%, delta=0.329%, char train=1.199%, word train=2.885%, skip ratio=0%,  New worst char error = 1.199 wrote checkpoint.

At iteration 43367/87600/87600, Mean rms=0.278%, delta=0.591%, char train=1.158%, word train=3.084%, skip ratio=0%,  New best char error = 1.158 wrote checkpoint.

At iteration 43377/87700/87700, Mean rms=0.277%, delta=0.553%, char train=1.189%, word train=3.468%, skip ratio=0%,  New worst char error = 1.189 wrote checkpoint.

At iteration 43388/87800/87800, Mean rms=0.291%, delta=0.61%, char train=1.362%, word train=3.604%, skip ratio=0%,  New worst char error = 1.362 wrote checkpoint.

At iteration 43396/87900/87900, Mean rms=0.287%, delta=0.602%, char train=1.338%, word train=3.475%, skip ratio=0%,  New worst char error = 1.338 wrote checkpoint.

At iteration 43413/88000/88000, Mean rms=0.293%, delta=0.595%, char train=1.255%, word train=3.899%, skip ratio=0%,  New worst char error = 1.255 wrote checkpoint.

At iteration 43421/88100/88100, Mean rms=0.303%, delta=0.683%, char train=3.811%, word train=4.078%, skip ratio=0%,  New worst char error = 3.811 wrote checkpoint.

At iteration 43426/88200/88200, Mean rms=0.303%, delta=0.687%, char train=3.804%, word train=4.154%, skip ratio=0%,  New worst char error = 3.804 wrote checkpoint.

At iteration 43431/88300/88300, Mean rms=0.294%, delta=0.671%, char train=3.74%, word train=3.768%, skip ratio=0%,  New worst char error = 3.74 wrote checkpoint.

At iteration 43443/88400/88400, Mean rms=0.271%, delta=0.596%, char train=3.528%, word train=3.262%, skip ratio=0%,  New worst char error = 3.528 wrote checkpoint.

At iteration 43449/88500/88500, Mean rms=0.268%, delta=0.623%, char train=3.563%, word train=3.266%, skip ratio=0%,  New worst char error = 3.563 wrote checkpoint.
.
.
.
At iteration 44578/99800/99800, Mean rms=0.261%, delta=0.519%, char train=6.131%, word train=2.909%, skip ratio=0%,  wrote checkpoint.

At iteration 44586/99900/99900, Mean rms=0.27%, delta=0.516%, char train=6.189%, word train=3.061%, skip ratio=0%,  wrote checkpoint.

At iteration 44600/100000/100000, Mean rms=0.3%, delta=0.633%, char train=8.195%, word train=3.399%, skip ratio=0%,  wrote checkpoint.

I tested the trained language on an image that has 18 lines. I got very bad results:

p p@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@g@

 

@@@@@

p@@@@@@@@@
@@@@@@
p@p@@@@@@@@@@
p@@@@@@@@

Nothing was correct from the extracted text. Then I tried to segment the image into lines, and I tested every line and got around 85% of the correct chars. Is there any missing step?

Thank you

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (3 by maintainers)

Most upvoted comments

The Makefile here uses --psm 13 (raw line, use the whole image) option to create training data from gt images. When tesseract recognizes multiple lines (--psm 3 or --psm 6), each line image is automatically cropped by tesseract and passed to network with an additional 4px of padding. If the margin size(line spacing) of the gt images is significantly larger than the automatic crop by tesseract, the trained model may be overfitting for the margin size.

So I think you need to check what kind of images(result of line segmentation) are input to the network first.

nagadomi on Jul 14, 2021

--ptsize option in tesstrain.py/text2image is not the pixel size measure. I’m not very familiar with it, but when I check it with the image viewer, it’s large then expected. If you specify ---save_box_tiff option to tesstrain.py, the tiff images will be saved in --output_dir and you can check it. For any font size, the line image will be resized to the network input size before being input (naturally, the resolution will be different). The network input size is specified at the beginning of the --net_spec option. For example, [1,36,0, .... will resize the height of the image to 36px.

nagadomi on Jul 15, 2021

@akmalkady you are confusing (text) recognition and (page) segmentation. Tesseract’s recognition (like all modern OCR engines) operates on line images. The CLI and API also have page segmentation (at various levels), but this is not model-driven (trained/neural) but algorithmic (rule-based).

So in the most basic use-case, you pass a line image to the CLI and set --psm 13 (raw line): this will do no segmentation at all. But you can also enter on --psm 6 (block) with regions images or --psm 3 (page) with fullpage images. This will do layout analysis and then pass the segmented (and cropped) lines to its recognition and finally aggregate these results into the output for that page.

In the training phase, segmentation (for obvious reasons) is not used, so you are effectively in PSM 13.

bertsky on Jul 13, 2021