gtfs-validator: False positive mixed_case_recommended_field for numeric values

Describe the bug

The mixed_case_recommended_field warning is always triggered when a field contains digits, like route_short_name=A1 or trip_short_name=101.

Characters which are not letters should not be considered at all when comparing case.

Steps/Code to Reproduce

Validate a file which uses numbers in fields like route_short_name or trip_short_name.

Expected Results

No mixed_case_recommended_field warning for those fileds.

Actual Results

{
      "code": "mixed_case_recommended_field",
      "severity": "WARNING",
      "totalNotices": 360,
      "sampleNotices": [
        {
          "filename": "routes.txt",
          "fieldName": "route_short_name",
          "fieldValue": "A1",
          "csvRowNumber": 2
        },
        {
          "filename": "routes.txt",
          "fieldName": "route_short_name",
          "fieldValue": "ZA12",
          "csvRowNumber": 3
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "301",
          "csvRowNumber": 2
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "5301",
          "csvRowNumber": 3
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "103",
          "csvRowNumber": 4
        },
        ...
}

Screenshots

No response

Files used

https://mkuran.pl/gtfs/wkd.zip

Validator version

4.1.0

Operating system

Debian 12

Java version

openjdk version “17.0.6” 2023-01-17

Additional notes

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (5 by maintainers)

Most upvoted comments

@MKuranowski Another possibility would be to match all words/tokens greater than 3 cased characters, (no numbers, no non-case script characters), and then check that each conforms to some rules (off the top of my head, no all-caps, and at least some with initial caps?). I don’t know though… at some point this is just a warning / suggestion and I think as long as we catch the bulk of the fields that should be corrected, users can ignore the warning for fields that are falsely flagged.

Also, as @isabelle-dr mentions here, in the future rules may be configurable so users that are dealing with a lot of false flags for this warning can opt to disable it.

@davidgamez @bdferris-v2 @isabelle-dr curious what your thoughts are.

The main problem with ZA12 is still not solved though; for that 2 consecutive letters with the same case should be allowed. As I mentioned earlier, there are a lot of systems which use 2-letter codes for routes; so I don’t think this should be an issue.

The v.toLowerCase() == v || v.toUpperCase() == v is not able to catch mixed-case situations (like route 1B), and, more importantly, still breaks for non-case scripts (급행12 or 東西線). The second situation could be mitigating to only matching on cased letters (\p{LC}), not just any letter from any script.

My regex flags route 1B correctly, but then has false positives on Another good value and Avenue des Champs-Élysées. Those contain all-lower-case words (good or des). I don’t think it’s possible to correctly categorize all 3 strings in a straightforward way, though.

v.matches(".*\\p{LC}{3}.*") && (v.toLowerCase() == v || v.toUpperCase() == v) seems to correctly categorize all mine and already-existing test cases, with the sole exception of route 1B.

A much more reliable strategy would be to rely on Unicode character properties.

Catching 3 consecutive upper case letters can be done with something like /\p{Lu}{3,}/u.test(value). 3 consecutive lower case letters are more tricky, as something like /\p{Ll}{3,}/u.test(value) also catches things like Orange. I think something like (?<!\p{LC})\p{Ll}{3,}/u.test(value) (3 consecutive lower case letters not preceded by any other cased letter) works: https://regex101.com/r/GVPMmO/1.

I like this idea! We would still get the notice for something like ZA12, right? Should we also add the following condition so that we differentiate acronyms (such as ZA12) from the natural text (such as Route 1B)?

  • the field value contains at least one " " character

@isabelle-dr I thought about this, but I think we want to keep it for long words with no spaces such as Lancaster?

Another approach would be to validate any values with strings of more than 1 letter, regardless of spaces. In this case A 101 and 342B would be ignored but RTE 101, RTE23 and route 1B would be flagged. If we don’t want something like RTE23 to be flagged, we could be more strict and ignore strings of mixed letters and numbers (but no spaces). So A12387abi would be ignored but ROUTE A2348B would be flagged.

@isabelle-dr How do we want to approach this? It looks like in some cases when numbers are present, we don’t want to throw the warning: A1 103B 1205

In those cases we could skip the mixed case validation altogether, but how about longer strings? Should we have a length minimum or number of letters in a string before mixed case is considered? Thinking about strings like:

MY ROUTE 100 route 1 boulevard

etc