libpostal: Suite/Apartment parsing is not correct
@daguar and I have been experimenting with @straup’s new libpostal API and finding some weird stuff with unit numbers in U.S. addresses. In most cases, libpostal misinterprets unit numbers as house numbers, and groups terms like “suite” with the road name.
Here are some odd examples:
{
"city": [
"oakland"
],
"house_number": [
"123",
"456"
],
"postcode": [
"94789"
],
"road": [
"main street apt"
],
"state": [
"ca"
]
}
{
"city": [
"oakland"
],
"house_number": [
"123"
],
"postcode": [
"94789"
],
"road": [
"main street suite 456"
],
"state": [
"ca"
]
}
{
"city": [
"oakland"
],
"house_number": [
"123"
],
"postcode": [
"94789"
],
"road": [
"main street # 456"
],
"state": [
"ca"
]
}
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 17 (11 by maintainers)
Commits related to this issue
- [names/ru] adding г. (город) prefix to Russian city names 50% of the time in various forms per #125 — committed to openvenues/libpostal by albarrentine 8 years ago
- [formatting] adding non-zero invert probabilities to all the former Soviet states. Other template insertions can still apply afterward for #125 — committed to openvenues/libpostal by albarrentine 8 years ago
- [addresses] adding pymorphy2 for converting Russian and Ukrainian place names (sticking with state and staet_district for the moment) to the locative case as mentioned in #125 — committed to openvenues/libpostal by albarrentine 8 years ago
- [fix] genitive case for Russian/Ukrainian toponyms, not locative (#125) — committed to openvenues/libpostal by albarrentine 8 years ago
- [formatting] adding alternate formats to formatting config which can be used with certain probabilities. Using the formats mentioned in #125 for Russia and other countries of the former USSR — committed to openvenues/libpostal by albarrentine 7 years ago
@jqnatividad a very worthwhile effort ✊. As someone who grew up close to the poverty line in low income communities of color, and a current resident of Crown Heights in Brooklyn, which is facing an unprecedented affordable housing crisis, I thank you.
The issue with secondary unit numbers without a phrase has also come up occasionally in some of my recent voting rights work, where voter file addresses may be concatenated by machine or transcribed and have no preceding unit phrase. It’s been relatively rare in my data sets, but is on my radar nonetheless.
As mentioned above, most of the secondary unit information in libpostal is generated randomly, and always with preceding phrases like “#” or “Apt”. This was primarily to prevent introducing certain systematic sources of error into the model. To illustrate, let’s say we generate three basic types of units: numeric only e.g. “12”, single letters like “A”, and combinations of the two like “12A” or “A12.” In the latter case of the combined number/letter, it should be pretty easy for the model to tell that it’s a unit number when following a street type (in English anyway, there are lots of different formats in other languages). However, if it were a single letter and that letter happened to be “E”, “N”, “S”, or “W”, then the concatenated unit would be virtually indistinguishable from a post-directional i.e. “123 Main St E” could be either “Unit E” at “123 Main St” or shorthand for “123 Main Street East”, and libpostal would effectively be training on both answers with no real way to disambiguate. Though less common, the same goes for numeric-only units as you can find examples where “Road 12” is the entire street name, and it may be difficult for the model to disambiguate between “12 Road 12” and “12 Main Road 12” (or doing so might inflate the size of the model, which is already large).
That said, the majority of these cases should be solved when the v1.1 release is available, which has been on the backburner in order to finish the address deduping release as part of the lieu deduping/batch geocoding project.
Is the housing advocacy work public in some way? That’s something I would personally contribute to if it were. If you want to email me the details it’s [first name][last name]@gmail.com. Can probably train something custom for it in the shorter term.
Hi @albarrentine, been using libpostal in OpenRefine and its great!
I’m currently working with a housing advocacy project in Brooklyn to protect affordable housing stock in NYC, and we got a lot of housing data with apartment numbers.
The data is fairly clean, and as you might expect, there are a lot of permutations for apartment number.
One pattern is giving libpostal problems:
I was really counting on libpostal to extract the unit numbers, which is does very well for patterns where you have prefixes like “Apt”, “Unit” , “#”, etc. but doesn’t work for the patterns above - which is the undecorated unit number between “road” and “city”
I’d say that both schemes are used in ex-USSR, with older one (postcode-country-city street-house-flat) being more popular among older people, and new one (street-house-flat postcode-city country) being more popular among younger people. The same goes about Surname-Name and Name-Surname ordering. Russian is really flexible about word order, which means there are addresses that aren’t unambiguously parseable when you remove commas from them.
Basically, you can take each component, shuffle all words in it, then shuffle all components and still get an address a human will be able to read. And you can get an address in all possible forms, ‘I live at Sosnovyi Bor, Geroev, 400’ - ‘Я живу в Сосновом Бору на проспекте Героев в доме 400’.
Postcodes are six-digit in ex-USSR, so if you see a six-digit number and language is from ex-USSR chances that it is a postcode are extremely high. First three digits are city number (or other administrative region when all you have is a bunch of villages), last three are local post office number. Last three can be zeros if sender doesn’t know exactly - postcode was (and still is, machines just got better at reading handwriting) machine-readable in USSR and written in predefined spot on envelope.
For NLP in Russian there’s a library pymorphy2.
It’s useful in two ways - first, it may help you understand what this word is about:
Second, it can help generating datasets - you can randomly change forms and it will give you correct spelling 😃
Stripping last letter and last two letters and last three letters can get you tokens that are more useful and contain less form information.
Back in 2014 I wrote a rule based address tagger for a data set of offices of companies that soon will have a government’s tax review. (The data was acquired by NextGIS as part of https://github.com/nextgis/skoroproverka).
Government dataset: addr-norep-nocount.txt.gz Tagger that worked reasonably good for me: https://github.com/Komzpa/addresstagger/blob/master/tokenize-address.py (have a look at a large
if token insection)There is also StreetMangler project that aims at aligning street names in Russian to “natural word order”. https://github.com/AMDmi3/streetmangler - it maintains a list of normalized street names and a statistical matcher.
Hope this helps 😃
Hey Mike. Yes, apartment numbers were not part of the training data in the master version of libpostal (used by the web API), mainly because OSM addresses don’t include much in the way of sub-building information.
There is a new model I’ve been working on that handles apartment number parsing quite well. The Pelias team has recently integrated an early version of that model into their work.
If you don’t mind compiling libpostal locally, the model used by Pelias can be found at: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz. To use that (doesn’t require switching branches or anything, it’s the same model in master trained on new data), just unpack the contents of the tarball into $DATA_DIR/libpostal/address_parser where $DATA_DIR is whatever you passed in during configure, default is /usr/local/share.
I discussed using the intermediate version with @straup. That’s still possible for the web API using the aforementioned steps (can easily be written into a docker file or whatever).
The next release (when parser-data is ready to merge into master) will be able to parse sub-building information in residential, commercial, and university addresses in at least 35 languages.