OpenRefine: P304 as reference breaks the Wikidata schema
Hi:
Here using OR v3.4 in Linux. Uploading data to Wikidata.
I found using the P304 brakes the schema: the UI doesn’t provide any clear feedback and when looking at the logs I got this:
21:18:01.030 [ command] Exception caught (15ms)
java.util.regex.PatternSyntaxException: Unknown inline modifier near index 6
(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*
^
at java.util.regex.Pattern.error(Pattern.java:1969)
at java.util.regex.Pattern.group0(Pattern.java:2908)
at java.util.regex.Pattern.sequence(Pattern.java:2065)
at java.util.regex.Pattern.expr(Pattern.java:2010)
at java.util.regex.Pattern.group0(Pattern.java:2919)
at java.util.regex.Pattern.sequence(Pattern.java:2065)
at java.util.regex.Pattern.expr(Pattern.java:2010)
at java.util.regex.Pattern.compile(Pattern.java:1702)
at java.util.regex.Pattern.<init>(Pattern.java:1352)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.getPattern(FormatScrutinizer.java:68)
at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.scrutinize(FormatScrutinizer.java:80)
at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinizeSnakSet(SnakScrutinizer.java:71)
at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinize(SnakScrutinizer.java:64)
at org.openrefine.wikidata.qa.scrutinizers.StatementScrutinizer.scrutinize(StatementScrutinizer.java:36)
at org.openrefine.wikidata.qa.EditInspector.inspect(EditInspector.java:107)
at org.openrefine.wikidata.commands.PreviewWikibaseSchemaCommand.doPost(PreviewWikibaseSchemaCommand.java:92)
at com.google.refine.RefineServlet.service(RefineServlet.java:187)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I’m not sure if this happens with any value for P304. In my case the precise value was 69-91
An example of the reference I pretend is, exactly, the one for P31 at Q100407249.
Without using P304 everything seems to work as expected.
Hope this helps.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 31 (18 by maintainers)
Commits related to this issue
- Ignore invalid regexes from Wikibase format constraints. Closes #3274. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Ignore invalid regexes from Wikibase format constraints. (#3721) * Ignore invalid regexes from Wikibase format constraints. Closes #3274. * Add logging — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Ignore invalid regexes from Wikibase format constraints. (#3721) * Ignore invalid regexes from Wikibase format constraints. Closes #3274. * Add logging — committed to gitonthescene/OpenRefine by wetneb 3 years ago
Please move the Wikidata discussion someplace more appropriate. Let’s keep this issue focused on the solution to handling illegal regular expressions in OpenRefine, which, as @wetneb said, is as simple as adding a try/catch block, thus the
good first issuelabel . All this extraneous discussion is just going to confuse/discourage newcomers who are trying to figure out what they need to code for a pull request.This issue is really easy to solve as far as OpenRefine is concerned: just by ignoring regexes which our regex engine cannot parse.
No I don’t think that’s the case. As @wetneb mentions right at the top of this thread:
While it might also be appropriate for OpenRefine to look at alternative approaches for regular expression parsing, I think the key thing in this case is for there to be a graceful failure when encountering regular expressions that can’t be parsed for some reason. This should (in my opinion) include telling the user that the condition can’t be checked, but not prevent the user from uploading the data in this case (note there is a difference here between data that definitely doesn’t match a regular expression and data that cannot be checked against a regular expression - I’d suggest we are strict with the former and relaxed with the latter)
Something was changed in the support of non-capturing groups (in the past the colon
:was non always needed after(?, but now it is).This may have been caused by a more recent change in the regexp engine used by Wikidata (now compatible with PCRE 7.7 or higher, and supporting the “PCRE v2” syntax, where the
(?can be followed by more flags, notably since it supports now named subroutines and other newer features of PCRE v2).I commented, and fixed that in the P304 talk page.
Just for the record:
The bug is reported at Wikidata P304 discussion page.
Which suggests that we should consider using a different regex engine whose features match Wikibase’s better.