quarkus: Apache Tika can not parse Microsoft Docx format in native mode

The test project in order to reproduce the problem is created here.

Steps to reproduce:

  • create a native executable: ./mvnw package -Pnative
  • start the binary: ./target/otaibe-apache-tika-docx-native-1.0-SNAPSHOT-runner
  • call the service
    • Option 1 : curl -v -H "Content-Type: application/octet-stream" -X POST --data-binary @src/test/resources/test_bg.docx http://localhost:11025/parse
    • Option 2 : mvn package -D%test.service.http.port=11025
  • the output for the binary execution throws an exception:
2020-01-14 14:43:40,589 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (executor-thread-1) HTTP Request to /parse failed, error id: 7eca2481-63eb-44e0-8c4c-4d57968f69ec-1: org.jboss.resteasy.spi.UnhandledException: org.apache.xerces.parsers.ObjectFactory$ConfigurationError: Provider org.apache.xerces.parsers.XIncludeAwareParserConfiguration not found
        at org.jboss.resteasy.core.ExceptionHandler.handleApplicationException(ExceptionHandler.java:106)
        at org.jboss.resteasy.core.ExceptionHandler.handleException(ExceptionHandler.java:372)
        at org.jboss.resteasy.core.SynchronousDispatcher.writeException(SynchronousDispatcher.java:209)
        at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:496)
        at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:252)
        at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:153)
        at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:363)
        at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:156)
        at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:238)
        at io.quarkus.resteasy.runtime.standalone.RequestDispatcher.service(RequestDispatcher.java:73)
        at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.dispatch(VertxRequestHandler.java:120)
        at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.access$000(VertxRequestHandler.java:36)
        at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler$1.run(VertxRequestHandler.java:85)
        at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
        at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:2011)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1535)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1426)
        at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
        at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
        at java.lang.Thread.run(Thread.java:748)
        at org.jboss.threads.JBossThread.run(JBossThread.java:479)
        at com.oracle.svm.core.thread.JavaThreads.threadStartRoutine(JavaThreads.java:460)
        at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:193)
Caused by: org.apache.xerces.parsers.ObjectFactory$ConfigurationError: Provider org.apache.xerces.parsers.XIncludeAwareParserConfiguration not found
        at org.apache.xerces.parsers.ObjectFactory.newInstance(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.<init>(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(Unknown Source)
        at org.apache.poi.ooxml.util.DocumentHelper.newDocumentBuilder(DocumentHelper.java:91)
        at org.apache.poi.ooxml.util.DocumentHelper.readDocument(DocumentHelper.java:165)
        at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.parseContentTypesFile(ContentTypeManager.java:392)
        at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.<init>(ContentTypeManager.java:104)
        at org.apache.poi.openxml4j.opc.internal.ZipContentTypeManager.<init>(ZipContentTypeManager.java:54)
        at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:258)
        at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:721)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:302)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at io.quarkus.tika.TikaParser.parseStream(TikaParser.java:85)
        at io.quarkus.tika.TikaParser.getMetadata(TikaParser.java:68)
        at io.quarkus.tika.TikaParser.getMetadata(TikaParser.java:64)
        at org.otaibe.apache.tika.docx.nerror.TikaParserResource.getContentType(TikaParserResource.java:52)
        at org.otaibe.apache.tika.docx.nerror.TikaParserResource.hello(TikaParserResource.java:38)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:151)
        at org.jboss.resteasy.core.MethodInjectorImpl.lambda$invoke$3(MethodInjectorImpl.java:122)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
        at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:110)
        at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:122)
        at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:594)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:468)
        at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:421)
        at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:363)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:423)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:391)
        at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invoke$1(ResourceMethodInvoker.java:365)
        at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
        at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
        at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:110)
        at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:365)
        at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:477)
        ... 19 more

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 62 (34 by maintainers)

Commits related to this issue

Most upvoted comments

Hi guys, there is some date when this will be corrected? I am oplanning to use Apache Tika with Quarkus in a Microservice environment, and this BUG is preventing the deploy of our stack.

Thank you @sberyozkin Will let you know about the progress.

@sberyozkin - from my previous message, please ignore the part about dependencies and about the dedicated call in order to be more productive. @geoand helps me a lot with ‘productivity’ setup. @geoand - thank you for that! About the java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart - it is fixed now. The next challenge is:

Caused by: java.util.MissingResourceException: Resource bundle not found org.apache.xerces.impl.msg.SAXMessages. Register the resource bundle using the option -H:IncludeResourceBundles=org.apache.xerces.impl.msg.SAXMessages.
        at com.oracle.svm.core.jdk.LocalizationSupport.getCached(LocalizationSupport.java:66)
        at java.util.ResourceBundle.getBundle(ResourceBundle.java:63)
        at org.apache.xerces.util.SAXMessageFormatter.formatMessage(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.getProperty(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.setProperty(Unknown Source)
        at org.apache.xmlbeans.impl.common.SAXHelper.trySetXercesSecurityManager(SAXHelper.java:119)
        at org.apache.xmlbeans.impl.common.SAXHelper.newXMLReader(SAXHelper.java:49)
        at org.apache.xmlbeans.impl.store.Locale.getSaxLoader(Locale.java:3055)
        ... 57 more

@geoand , @gsmet , @sberyozkin - I can try to do it.

@tpenakov You can find some information at: https://quarkus.io/guides/writing-native-applications-tips

The real hard-core information however can be found here: https://quarkus.io/guides/writing-extensions.

The people on the Quarkus team would be glad to help you out should you decide to take this on