fscrawler: OCR settings being ignored
Describe the bug
Although I have ocr set disable in my job options file, fscrawler is still running tesseract on pdf files it encounters while scanning. Further more, ocs is done with default settings, instead of the setting provided in the jobfile:
root 29857 6120 0 09:46 pts/3 00:00:00 tesseract /tmp/apache-tika-6540379283841536434.tmp /tmp/apache-tika-87294301792944645.tmp -l eng --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt
Jobfile goes as this:
ocr:
enabled: false
language: "de"
pdf_strategy: "no_ocr"
Versions:
- fscrawler-es6-2.7-20190528.081956-40
- CentOS Linux 7.6
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 21 (10 by maintainers)
So… I don’t know, what has changed in my setup, but since I rebuild my system, fscrawler behaves like it is supposed to. I think, that this issue can be closed.
Yeah, but I don’t see, how you disable the OCR parser in line 86. Looks the opposite to me, actually. I have also just for fun renamed the tesseract binary, such as that Tika can’t find it anymore. This of course effectively, disables tesseract. I will prepare a liitle test with mixed office documents and see, if I can confirm my assumptions. If I can , I will get you some documents for your integration tests.
I can try that, but I wonder what Tika will do, if it cannot find tesseract at that location. Will it just continue indexiing the document it currently processes, or will it simply fail and return nothing?
If fscrawler sets up Tika once for the entirety of the job, than you should be ablet to disable Tika’s internal OCR using this setup:
I also think that this explains, why Tika runs tesseract with the default language, instead of the language setup in the fscrawler jobfile… at least, this would make sense to me.