OpenRefine: P304 as reference breaks the Wikidata schema

Hi:

Here using OR v3.4 in Linux. Uploading data to Wikidata.

I found using the P304 brakes the schema: the UI doesn’t provide any clear feedback and when looking at the logs I got this:

21:18:01.030 [                  command] Exception caught (15ms)
java.util.regex.PatternSyntaxException: Unknown inline modifier near index 6
(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*
      ^
	at java.util.regex.Pattern.error(Pattern.java:1969)
	at java.util.regex.Pattern.group0(Pattern.java:2908)
	at java.util.regex.Pattern.sequence(Pattern.java:2065)
	at java.util.regex.Pattern.expr(Pattern.java:2010)
	at java.util.regex.Pattern.group0(Pattern.java:2919)
	at java.util.regex.Pattern.sequence(Pattern.java:2065)
	at java.util.regex.Pattern.expr(Pattern.java:2010)
	at java.util.regex.Pattern.compile(Pattern.java:1702)
	at java.util.regex.Pattern.<init>(Pattern.java:1352)
	at java.util.regex.Pattern.compile(Pattern.java:1028)
	at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.getPattern(FormatScrutinizer.java:68)
	at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.scrutinize(FormatScrutinizer.java:80)
	at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinizeSnakSet(SnakScrutinizer.java:71)
	at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinize(SnakScrutinizer.java:64)
	at org.openrefine.wikidata.qa.scrutinizers.StatementScrutinizer.scrutinize(StatementScrutinizer.java:36)
	at org.openrefine.wikidata.qa.EditInspector.inspect(EditInspector.java:107)
	at org.openrefine.wikidata.commands.PreviewWikibaseSchemaCommand.doPost(PreviewWikibaseSchemaCommand.java:92)
	at com.google.refine.RefineServlet.service(RefineServlet.java:187)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
	at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
	at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
	at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I’m not sure if this happens with any value for P304. In my case the precise value was 69-91

An example of the reference I pretend is, exactly, the one for P31 at Q100407249.

Without using P304 everything seems to work as expected.

Hope this helps.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 31 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Please move the Wikidata discussion someplace more appropriate. Let’s keep this issue focused on the solution to handling illegal regular expressions in OpenRefine, which, as @wetneb said, is as simple as adding a try/catch block, thus the good first issue label . All this extraneous discussion is just going to confuse/discourage newcomers who are trying to figure out what they need to code for a pull request.

This issue is really easy to solve as far as OpenRefine is concerned: just by ignoring regexes which our regex engine cannot parse.

So if I understand correctly, due to limitations in OpenRefine this problem can’t be fixed unless the project switches to a different regex engine?

No I don’t think that’s the case. As @wetneb mentions right at the top of this thread:

Looking at the stack trace that’s a clear bug: since these regular expressions are user-supplied, we should make sure that they are ignored if they are invalid. They should be ignored rather than reported to the user since the user is not responsible for maintaining Wikidata constraints.

While it might also be appropriate for OpenRefine to look at alternative approaches for regular expression parsing, I think the key thing in this case is for there to be a graceful failure when encountering regular expressions that can’t be parsed for some reason. This should (in my opinion) include telling the user that the condition can’t be checked, but not prevent the user from uploading the data in this case (note there is a difference here between data that definitely doesn’t match a regular expression and data that cannot be checked against a regular expression - I’d suggest we are strict with the former and relaxed with the latter)

Something was changed in the support of non-capturing groups (in the past the colon : was non always needed after (?, but now it is).

This may have been caused by a more recent change in the regexp engine used by Wikidata (now compatible with PCRE 7.7 or higher, and supporting the “PCRE v2” syntax, where the (? can be followed by more flags, notably since it supports now named subroutines and other newer features of PCRE v2).

I commented, and fixed that in the P304 talk page.

Just for the record:

The bug is reported at Wikidata P304 discussion page.

Which suggests that we should consider using a different regex engine whose features match Wikibase’s better.