fscrawler: FSCrawler can't index .doc or .docx elements

Describe the bug

Whenever I try to index .doc or .docx files I get a warning and the files don’t get indexed.

07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor

It all works fine with .pdf documents and so I expected with word documents.

Versions:

OS: Windows 10
Elasticsearch Version 7.5.2
FSCrawler Version 2.7

EDIT:

So I recreated a .docx file with a few sentence and it worked. So what does the above error means?

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 27 (11 by maintainers)

Most upvoted comments

I guess that you would need to change some libs in FSCrawler lib dir. Or revert this https://github.com/dadoonet/fscrawler/pull/855 and compile the project again.

dadoonet on Feb 19, 2020

Thank you for the file. That’s definitely a bug in FSCrawler which has been introduced by #855

To fix it, I “just” need to pull in this PR: #865 but there’s still “a blocker” in that one as I have seen a regression. I need to revisit it at some point.

dadoonet on Feb 12, 2020

oh damn, i thought it was a typo and sent it to elastic.com 🤦‍♂

LaaKii on Feb 12, 2020

May be it’s because the metadata only are indexed for the .docx file. Not the content itself. Could you share the JSON document that has been generated for the .docxfile?

dadoonet on Feb 12, 2020