fscrawler: FSCrawler can't index .doc or .docx elements

Describe the bug

Whenever I try to index .doc or .docx files I get a warning and the files don’t get indexed.

07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor

It all works fine with .pdf documents and so I expected with word documents.

Versions:

  • OS: Windows 10
  • Elasticsearch Version 7.5.2
  • FSCrawler Version 2.7

EDIT:

So I recreated a .docx file with a few sentence and it worked. So what does the above error means?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 27 (11 by maintainers)

Most upvoted comments

I guess that you would need to change some libs in FSCrawler lib dir. Or revert this https://github.com/dadoonet/fscrawler/pull/855 and compile the project again.

Thank you for the file. That’s definitely a bug in FSCrawler which has been introduced by #855

To fix it, I “just” need to pull in this PR: #865 but there’s still “a blocker” in that one as I have seen a regression. I need to revisit it at some point.

oh damn, i thought it was a typo and sent it to elastic.com 🤦‍♂

May be it’s because the metadata only are indexed for the .docx file. Not the content itself. Could you share the JSON document that has been generated for the .docxfile?