gtfs-validator: False positive mixed_case_recommended_field for numeric values
Describe the bug
The mixed_case_recommended_field warning is always triggered when a field contains digits, like route_short_name=A1 or trip_short_name=101.
Characters which are not letters should not be considered at all when comparing case.
Steps/Code to Reproduce
Validate a file which uses numbers in fields like route_short_name or trip_short_name.
Expected Results
No mixed_case_recommended_field warning for those fileds.
Actual Results
{
"code": "mixed_case_recommended_field",
"severity": "WARNING",
"totalNotices": 360,
"sampleNotices": [
{
"filename": "routes.txt",
"fieldName": "route_short_name",
"fieldValue": "A1",
"csvRowNumber": 2
},
{
"filename": "routes.txt",
"fieldName": "route_short_name",
"fieldValue": "ZA12",
"csvRowNumber": 3
},
{
"filename": "trips.txt",
"fieldName": "trip_short_name",
"fieldValue": "301",
"csvRowNumber": 2
},
{
"filename": "trips.txt",
"fieldName": "trip_short_name",
"fieldValue": "5301",
"csvRowNumber": 3
},
{
"filename": "trips.txt",
"fieldName": "trip_short_name",
"fieldValue": "103",
"csvRowNumber": 4
},
...
}
Screenshots
No response
Files used
https://mkuran.pl/gtfs/wkd.zip
Validator version
4.1.0
Operating system
Debian 12
Java version
openjdk version “17.0.6” 2023-01-17
Additional notes
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (5 by maintainers)
@MKuranowski Another possibility would be to match all words/tokens greater than 3 cased characters, (no numbers, no non-case script characters), and then check that each conforms to some rules (off the top of my head, no all-caps, and at least some with initial caps?). I don’t know though… at some point this is just a warning / suggestion and I think as long as we catch the bulk of the fields that should be corrected, users can ignore the warning for fields that are falsely flagged.
Also, as @isabelle-dr mentions here, in the future rules may be configurable so users that are dealing with a lot of false flags for this warning can opt to disable it.
@davidgamez @bdferris-v2 @isabelle-dr curious what your thoughts are.
The main problem with
ZA12is still not solved though; for that 2 consecutive letters with the same case should be allowed. As I mentioned earlier, there are a lot of systems which use 2-letter codes for routes; so I don’t think this should be an issue.The
v.toLowerCase() == v || v.toUpperCase() == vis not able to catch mixed-case situations (likeroute 1B), and, more importantly, still breaks for non-case scripts (급행12or東西線). The second situation could be mitigating to only matching on cased letters (\p{LC}), not just any letter from any script.My regex flags
route 1Bcorrectly, but then has false positives onAnother good valueandAvenue des Champs-Élysées. Those contain all-lower-case words (goodordes). I don’t think it’s possible to correctly categorize all 3 strings in a straightforward way, though.v.matches(".*\\p{LC}{3}.*") && (v.toLowerCase() == v || v.toUpperCase() == v)seems to correctly categorize all mine and already-existing test cases, with the sole exception ofroute 1B.A much more reliable strategy would be to rely on Unicode character properties.
Catching 3 consecutive upper case letters can be done with something like
/\p{Lu}{3,}/u.test(value). 3 consecutive lower case letters are more tricky, as something like/\p{Ll}{3,}/u.test(value)also catches things likeOrange. I think something like(?<!\p{LC})\p{Ll}{3,}/u.test(value)(3 consecutive lower case letters not preceded by any other cased letter) works: https://regex101.com/r/GVPMmO/1.@isabelle-dr I thought about this, but I think we want to keep it for long words with no spaces such as
Lancaster?Another approach would be to validate any values with strings of more than 1 letter, regardless of spaces. In this case
A 101and342Bwould be ignored butRTE 101,RTE23androute 1Bwould be flagged. If we don’t want something likeRTE23to be flagged, we could be more strict and ignore strings of mixed letters and numbers (but no spaces). SoA12387abiwould be ignored butROUTE A2348Bwould be flagged.@isabelle-dr How do we want to approach this? It looks like in some cases when numbers are present, we don’t want to throw the warning:
A1103B1205In those cases we could skip the mixed case validation altogether, but how about longer strings? Should we have a length minimum or number of letters in a string before mixed case is considered? Thinking about strings like:
MY ROUTE 100route 1 boulevardetc