quarkus: Native-Image: Newly created project with Tika extension can't extract anything

Describe the bug I tried creating a Tika project following the documentation on Quarkus. I think any pdf would produce the same result, but here’s the one I tried. americanexpress_01.pdf

Expected behavior To give a little context, I should be able to extract any text from any document tika supports, so I simply need to have the whole Tika library bundle in the native application, which doesn’t seem to be what is happening right now.

Actual behavior With the documentation’s example, when doing the curl post with the PDF above, I get.

Error: Could not find referenced cmap stream Identity-H

Looking at the stacktrace, I realise that this resource is missing inside PDFBox (a tika dependency) inside the bundled native-image, along with all other tika dependencies…

Configuration

quarkus.package.type=native
quarkus.package.uber-jar=true
quarkus.log.console.enable=false
quarkus.native.add-all-charsets=true
quarkus.native.additional-build-args=-H:ReflectionConfigurationFiles=reflection-config.json,-H:ResourceConfigurationFiles=resources-config.json 

What I tried To test really fast I tried adding a resources-config.json to the project.

{
  "resources": [
    {
      "pattern": ".*"
    }
  ]
}

and now success, the native app can extract the text of my file (yeah). But, trying different PDFs leaded to different errors, and this time they were runtime errors.

So, considering I followed the documentation properly and that the native-image build process ends without any error, I’m really wondering what i’m doing wrong, since this is supposed to parse any files and it doesn’t, because there seems to be missing A LOT of dependencies inside the native-image.

Is the Quarkus Tika extension really supposed to be usable, or do I have to create the extension with all I need myself as done in the TikaProcessor.java (which doesn’t seem to include many things)

For my use-case, is there a way I can include everything from Tika ?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

You’re right, the link seems dead. There is this medium post that explains well the base configuration.

Otherwise, Oracle has a more stable documentation here