fscrawler: OCR settings being ignored

Describe the bug

Although I have ocr set disable in my job options file, fscrawler is still running tesseract on pdf files it encounters while scanning. Further more, ocs is done with default settings, instead of the setting provided in the jobfile:

root     29857  6120  0 09:46 pts/3    00:00:00 tesseract /tmp/apache-tika-6540379283841536434.tmp /tmp/apache-tika-87294301792944645.tmp -l eng --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt

Jobfile goes as this:

ocr:
  enabled: false
  language: "de"
  pdf_strategy: "no_ocr"

Versions:

fscrawler-es6-2.7-20190528.081956-40
CentOS Linux 7.6

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 21 (10 by maintainers)

Most upvoted comments

So… I don’t know, what has changed in my setup, but since I rebuild my system, fscrawler behaves like it is supposed to. I think, that this issue can be closed.

budachst on Jun 17, 2019

Yeah, but I don’t see, how you disable the OCR parser in line 86. Looks the opposite to me, actually. I have also just for fun renamed the tesseract binary, such as that Tika can’t find it anymore. This of course effectively, disables tesseract. I will prepare a liitle test with mixed office documents and see, if I can confirm my assumptions. If I can , I will get you some documents for your integration tests.

budachst on Jun 6, 2019

I can try that, but I wonder what Tika will do, if it cannot find tesseract at that location. Will it just continue indexiing the document it currently processes, or will it simply fail and return nothing?

If fscrawler sets up Tika once for the entirety of the job, than you should be ablet to disable Tika’s internal OCR using this setup:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

I also think that this explains, why Tika runs tesseract with the default language, instead of the language setup in the fscrawler jobfile… at least, this would make sense to me.

budachst on Jun 6, 2019