There are many things involved in Searching for Places. It’s not the easiest thing to do because, as it turns out, the world is a complex place.
In our new series, Meaningful Geocoding, we’re going to share the principles and practices of geocoding that we’re incorporating as we build Pelias, our open source geocoder that powers our service Mapzen Search entirely on open data.
There are a lot of ways that people search for places. Sometimes they search for places by name, like like a business or a landmark. But not every place has a name and names can be just about anything.
But nearly every place has an Address. They’re long, they’re complex, they’re compound, and they can act as unique identifiers for a place. There are systems and patterns embedded in addresses providing incremental hinting about location, and we can use these patterns to return the best results, even if the addresses themselves don’t always make the most sense.1
With that in mind, here are some principles that we’re taking to heart as we build Mapzen Search and Pelias.
Heterogeneous and Unambiguous
When someone performs a search on a geocoder, the places that come back are ideally:
Heterogeneous results match what the user entered for all matching places, big and small. That is, if the user only provides “New York” it’s perfectly fine to return New York the city, New York the county, and New York the state.
Unambiguous results match what the user entered. In other words: don’t return what the user didn’t ask for. If the user entered “New York, NY”, don’t return “Village of New York Mills, Oneida County, NY” or “West New York, NJ”.
Principles of Addressing
Users are, for the most part, rational individuals who provide input in a coherent fashion. Really! People’s searches take the form of : what they’re looking for, where.
The where is almost always organized hierarchically. That is, users enter
30 W 26th St, New York, NY 10010,
St NY W 30 10010 York New 26th
30 W 26th St, New York, 10010 NY.
Format details vary by country, but after centuries of addressing correspondence, societies worldwide have done a pretty good job of establishing standards so that addresses are interpreted one way. 2
On occasion, users (especially on mobile devices) leave out certain bits of information for the sake of efficiency but, even then, they will not rearrange the order. For example, I might enter an address as
30 W 26th St 10010, leaving out the city and state.
Address points are like connective tissue for geodata. Businesses are at addresses. People’s homes are at addresses. They don’t live in latitudes and longitudes, but that’s where all our geo tools live. Because of this, precise full address geocoding is the bread and butter of commercial place search, and not to mention necessary to support end-user navigation. If an address isn’t searchable in the system, someone can’t drive there. Let’s do our damnedest to get them there.
Every geocoder can only be as good as the data behind it; it needs a model of the world to tell you where you’re looking for. An address geocoder requires a lot of data that someone went out and collected.
There are two ways of precision geocoding street addresses and they’re defined by the data they use: point geocoding and interpolated geocoding. They both require a lot of data, but point geocoding only works when the source data already contains an exact point matching the address, which is a monumental effort to collect.3
Conversely, interpolated geocoding is used when the source data contains just the line segments of the road network and corresponding address number (block) ranges. It’s usually a lot easier to use an existing map of the road network containing the start and stop segments for addresses than it is to go out and collect each and every address point. In this case, an interpolation geocoder will find the appropriate street and make a very educated guess at where along the line the house number resides.
When data is available, most geocoders implement both approaches, using point data when possible and falling back to interpolation when needed.
Known Address Data Points
Users often enter a full street+city+state with a known point address number that must be resolved as unambiguously as possible.
Example: 30 W 26th St, New York, NY
In this case, the address number is a known point (with a latitude/longitude) in the data.
A geocoder should return:
- 30 W 26th St, New York, NY
Because we can be certain this is an exact match, no other results should be returned, including:
- no other address numbers on W 26th St in New York
- no ‘30 W 26th St’ in another city
- no '30 E 26th St, New York, NY’
- no businesses on W 26th St, New York, NY
Users can enter an address number which is unknown to the system but still falls within the known block ranges of the source data. The input is most likely valid, but our address point data doesn’t have the exact house number.
Example: 87 W 26th St, New York, NY
In this case the address number is not a known address number in the point data, so the lat/lon returned is interpolated along the known line segments. The same rules apply as before: don’t return anything else but the unambiguous address.
A geocoder should return:
- 87 W 26th St, New York, NY
Address Beyond Known Range
Users can enter an unknown house number that’s outside all known block ranges in the source data.
Example: 99999 W 26th St, New York, NY
In this case there is no address number
99999and the block ranges don’t go that high. There are two ways to deal with this scenario:
- return the last block range for W 26th St, New York, NY
- return the middle of the line segments for W 26th St, New York, NY representing the street in general, letting the user know that you’re only returning the street because the house number wasn’t found.
Users sometimes don’t know a house number so they just enter the street.
Example: W 26th St, New York, NY
There are several ways to deal with this. One approach is to return all the block ranges available in data:
- [1-99] W 26th St, New York, NY
- [100-199] W 26th St, New York, NY
Alternatively, the result returned can be just the lat/lng of the midpoint of the line representing the street.
Partial Street Geocoding
While rare (at least in the US/CA), it’s possible that a street input can be ambiguous within a city. This normally occurs when there are multiple streets within the same city with different street types.
New York is notorious for partial street geocodes 4.
Example: 30 26th St, New York, NY
In this example, the user didn’t supply a directional, like
East 26th Streetor
West 26th Street. When directionals are present in the data but not supplied by the user, a geocoder should return results on all matching streets:
- 30 W 26th St, New York, NY
- 30 E 26th St, New York, NY
If, however, the supplied house number is only applicable to one or the other, a geocoder should only return the address that’s physically possible. For example, 601 is within the known block ranges of W 26th St but not E 26th St (geographically this address would be in the East River).
Example: 23 10th St, NYC
In this example, the user didn’t supply a directional or a borough, but implies that the address is within New York City (New York City is comprised of five smaller “boroughs”, which addresses fall within). In this case there are multiple 10th Streets in Manhattan (East and West), as well as within the other boroughs. When there are multiple matching areas contained within the search and valid address ranges for all of them, a geocoder should return:
- 23 E 10th St, New York, NY
- 23 W 10th St, New York, NY
- 23 10th St, Brooklyn, NY
- 23 10th St, Staten Island, NY
- 38-23 10th St, Queens, NY 5
Example: 23 10th, New York, NY
In this example, the user supplied neither the directional nor the street type. A geocoder should return:
- 30 10th Ave, New York, NY
- 30 E 10th St, New York, NY
- 30 W 10th St, New York, NY
There is room for flexibility in this example for business rules to help make the decision. For example, if there is point data for some of the addresses but not others, it’s perfectly acceptable to only return results for which there is point data.
Addresses aren’t the only places people search for. In the next installment we’ll examine some of the rules for when people search for towns, states, or countries, and how to give unambiguous results, even when the search could mean many things.
If cracking the code of how we organize the world sounds interesting to you, we’d love to work with you. Our project is 100% open, so it would be great to have you as an open source contributors, and we’re hiring for another person to join our Search team to work on geocoding full time.
Of course, searching for named places can take advantage of geocoding, and geocoding can use search technologies. This is what we we’re working on to craft a scalable approach that helps solve both of these problems. ↩
For instance, Germany’s addresses follow the pattern of “StreetName HouseNumber”. Even still, that structure can be parsed out and helps localize results further. A thorough analysis of log files for any geocoder would find inputs that don’t match commonly-accepted patterns this but such instances are extremely rare. ↩
In fact, collecting point address data is part of the genesis of Google’s Street View. But you don’t have to drive every street to get good address point data. While it takes a lot of effort to create and collect point address data, there’s a remarkable open data project OpenAddresses which is building a massive directory of openly-available and -licensed address points dataset worldwide. If you like to hack on open data, helping contribute to OpenAddresses by hunting down and cleaning up open address data makes for a lot of fun and a big difference. ↩
Addresses in NYC can be ludicrous for a whole host of reasons. There are directional prefixes (north, south, east, and west). There are streets with the same name and valid addresses in different boroughs. We’ll be looking into how to deal with the weird addresses in one of the upcoming posts of this series. ↩
The borough of Queens, NY has a whole host of address anomalies, starting with hyphenated addresses. The unique housenumber in this example is 23, but because the cross-street is 38th ave, the full address is 38-23 10th St, Queens, NY. ↩