fscrawler: FSCrawler can't index .doc or .docx elements
Describe the bug
Whenever I try to index .doc or .docx files I get a warning and the files don’t get indexed.
07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor
It all works fine with .pdf documents and so I expected with word documents.
Versions:
- OS: Windows 10
- Elasticsearch Version 7.5.2
- FSCrawler Version 2.7
EDIT:
So I recreated a .docx file with a few sentence and it worked. So what does the above error means?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 27 (11 by maintainers)
I guess that you would need to change some libs in FSCrawler lib dir. Or revert this https://github.com/dadoonet/fscrawler/pull/855 and compile the project again.
Thank you for the file. That’s definitely a bug in FSCrawler which has been introduced by #855
To fix it, I “just” need to pull in this PR: #865 but there’s still “a blocker” in that one as I have seen a regression. I need to revisit it at some point.
oh damn, i thought it was a typo and sent it to elastic.com 🤦♂
May be it’s because the metadata only are indexed for the
.docxfile. Not the content itself. Could you share the JSON document that has been generated for the.docxfile?