If you’ve read the first introductory blog post about Who’s On First or the talk I did at FOSS4G introducing the Spelunker much of what follows will be familiar. It was a short talk so rather than get lost in the technical details I tried instead to focus on some of the principles, and statements of bias, that influence the project. Why we’re doing this, rather than how, particularly for people who might not have read those first two blog posts.
My talk was titled “Mapping with Bias” and this is what I said.
No one has any idea what they depict and that’s led to some hilarious speculation about what they “are” ranging from a hockey stick (I am from Canada) to the Lexus car logo to an e-cigarette.
…a pressure measurement instrument used to measure fluid flow velocity.
It goes on the describe “flow velocity” as:
…a vector field which is used to mathematically describe the motion of a continuum.
Who’s On First is not really a pitot tube but I like that idea that there might be an instrument to measure the motion – the velocity – of people’s understanding of place.
Who’s On First is a gazetteer. A gazetteer is a “big bag of places” in which every place has a stable and permanent identifier, supporting metadata and pointers to other IDs in the gazetteer for places with which it has a relationship.
Over 15, 000 words have written about Who’s On First so far because it turns out that gazetteers are a pretty complicated subject.
Rather than trying to squeeze all the details in to a 20-minute talk I am going to focus instead on some of the first principles motivating the project and governing our day-to-day work.
The toxic trinity of “geo” has always been the unholy union of: licensing and coverage and quality.
The aim of Who’s On First is to tackle all three at the same time.
If and when we are forced to “pick two” we will choose licensing and coverage, so that in the end there is always something left over that people can improve as time and circumstances permit.
These decisions make Who’s On First both ambitious and daunting so I have always felt that it is important for us to have a governing bias with which to negotiate the complexities and the quicksand that the project will inevitably yield.
In many ways, these are principles to help us understand what the project is not and to help us understand how things should be even if the technology doesn’t always work as well as we imagine it should, yet.
A gazetteer is a pretty brainy project. Gazetteers are one of those things that don’t seem important at all until they are, at which point they suddenly take on an outsized importance. This means we need to design things in such a way that our work can outlast people’s reluctance.
We need to build something with the patience and the stamina, conceptually and financially, to sit quietly in the corner and be ready to be of service when you are and not before.
What follows are six “umbrella” ideas that we keep in mind as we work towards that goal.
The first is the idea that Who’s On First is a gazetteer of consensual hallucinations.
But there is a good reason that, for example, California alone has seven state planes: There are actual tax dollars, and services like emergency responders, that depend on being able to precisely and accurately locate a thing in a world where latitude can not be neatly subdivided in to equal units across the surface of the globe.
It is worth remembering that coordinate space is one of the truly great abstractions. Being able to reduce the problem, in so many cases, to fit a Cartesian grid has made some pretty amazing things possible.
It’s not just Flickr. There is a long and growing list of companies – really, all companies – whose services are implicitly built around social rather than administrative notions of place. And social notions of place are messy and an inexact and complicated. This is the space where Who’s On First sits.
Who’s On First is, by design, not a gazetteer of unitary perspective.
Or put another way, Who’s On First is not a gazetteer of geometries.
One of our earliest decisions was on that each record in Who’s On First would contain multiple geometries.
As a rule every place in Who’s On First should contain a so-called “ground truth” geometry. A ground truth geometry is like Benoit Mandelbrot’s map of England which, by definition, means it will always grow in size and detail.
Some places might even have two ground truth geometries, one clipped to the coastline and another than includes territorial waters. These geometries are especially good for reverse geocoding but the salient point in all of this is that the geometries themselves, as often as not, enforce the biases of their use.
There will inevitably be disputed geometries. These are different from places that have been classified as disputed, places like Kashmir or the Golan Heights. These are places where the stakes are not so high. Places where even though their may be officially recognized boundaries people still bicker over the details. Effectively all neighbourhoods, everywhere.
We may disagree on where The Tenderloin starts and stops, for example, but we all agree that The Tenderloin exists.
The purpose and the value of Who’s On First is in giving those notions of place a collective proof. In giving them a mass and weight and a gravity in the universe that other people and products can orbit.
Another core principle of Who’s On First that is every record shares a common set of ancestors.
Hierarchies, in particular administrative hierarchies, vary wildly from country to country. We used to say that all locations in Who’s On First share a common hierarchy but I think that was often more confusing than not.
It is an articulation that lends itself to the idea, incorrectly, that there is a single comprehesive hierarchy which encodes all the relationships between places. That is not what Who’s On First tries to do.
Instead we have said that there are five common placetypes – continent, country, region, locality and neighbourhood – and that every record in Who’s On First, regardless of its specific placetype has at least one of the common placetypes as an ancestor.
This acts as a baseline for a global dataset, both on a conceptual and a practical level. It is important to us that, within reason, we not impose any single architectural approach or set of technical requirements in order to be able to use Who’s On First.
Five “database” columns for encoding a global hierarchy seems like a reasonable trade-off in 2016. If you need to include Brooklyn, NY (which is technically a borough) in your dataset then you’ll need to add a sixth column but that’s your business. Otherwise you can hopefully make do with New York City.
Importantly, unknown place types are not a fatal error. They are left to the needs and discretions of people using Who’s On First for whatever they need to use it for, without sacrificing a common ground where all of these projects can still comfortably hold hands.
There is also a related discussion about places having multiple hierarchies but we don’t have time for that tonight. Suffice it to say that places can and do have multiple hierarchies for much the same reasons that a place might have multiple geometries.
Who’s On First is not a linear scorched-earth view of the world.
Places change. The physical boundaries of the USA changed 141 times between the years 1789 and 1959. The entire notion of what Yugoslavia meant changed three times in the 20th century before finally atomizing in to seven countries, by 2008.
Ultimately there is a much larger question about how an individual, or worse a community, decides whether an event constitutes a simple update versus a fundamental change. This is the realm of hard philosophical questions and those are things we are not going to try to answer.
We can provide breadcrumbs, though. Every record in Who’s On First has both a
supersedes property that are used to signal that a change has occurred but not necessarily why. That part is left up to you.
These properties act as a kind of linked-list for places indicating, for example, that the Kingdom of Yugloslavia was superseded by the Federal People’s Republic of Yugoslavia in 1946, and so on.
This decision means two things:
- That there might be multiple entries for the “same” place in Who’s On First and consumers of the data need to account for this fact.
- That if you have been using the the first iteration of a place in Who’s On First its meaning and semantics won’t suddenly change when there is a legimate reason to create a second iteration.
We do this as a way to foster confidence in the robustness and durability of Who’s On First identifiers. The past is complicated territory and though it is not the focus of our daily work we want to try and make sure that it is always welcome.
Who’s On First is a gazetteer of signal fires.
It’s probably obviously by now but it bears repeating: The world is full of complex and contradictory opinions. We do not want to try and settle those debates. We can not settle those debates.
For almost as long as we’ve had the notion of place itself people have had the benefit of complete sentences and entire paragraphs and even book-length arguments to make sense of the nature and meaning and value of place.
And still we don’t agree so I don’t know why anyone can imagine that a bag of key/value pairs will do better at answering any of these questions.
Obviously there are a few instances where Who’s On First needs to assert some degree of editorial opinion about but as a rule we try to do this as infrequently and as transparently as possible.
When there is genuine debate about something we leave it to the consumers of the data to interpret. We want to signal that there is debate about something rather than try to gloss over the awkward bits.
Finally, the data is not the database.
I mentioned at the beginning that Who’s On First was designed to “outlast people’s reluctance”.
What this means, in concrete terms, is that at its core Who’s On First is a gigantic bag of plain-text files. The failure scenario for updating a Who’s On First record should always be the ability to edit it using nothing more than a text editor. You shouldn’t have to do that but when everything else breaks you still can do that.
The point is not that Who’s On First doesn’t play with databases but that it should be able to play nicely with all the databases. The point is that the demands Who’s On First places on its users should be as universal as possible across platforms and concerns.
Sometimes this makes getting things set up a little harder than we’d like but it’s 2016 and we’ve all gotten pretty good at processing text files at scale and feeding them in to databases.
Despite all the advances we’ve made over the years it turns out that the simplest, most universal and accessible thing is still plain-old, plain-vanilla, plain-text files on disk.
This focus – of demanding a high degree of portability and durability in our work - is very much influenced by the early systems designs for the Unix, and Multics before it, operating system and more recently the Unicode project.
But that is the work.
Thank you. If you’d like a sticker send up a flare.
- Bound Print (France); Purchased for the Museum by the Advisory Council; 1921-6-559-19
- Paper Construction, Buck Rogers, 25th Century Featuring Buddy and Allura in “Strange adventures in the Spider Ship”; Attacked by the Giant Reptile, ca. 1935; Collection of Smithsonian Institution Libraries
- Figure, ca. 1960; porcelain, enamel; Gift of Ludmilla Shapiro; 1993-13-2
- cosmonauts and rocket Figure, 1960–70; Made by Gzhel Porcelain Factory ; porcelain, enamel.; Gift of Ludmilla Shapiro; 1993-13-1 Figure; biscuit; Gift of Eleanor and Sarah Hewitt; 1931-88-89-a/d Triton Figure, 19th century; Manufactured by Sèvres Porcelain Manufactory (France); France; biscuit porcelain; Gift of Eleanor and Sarah Hewitt; 1931-88-88-a/d Ink Plot; computer ink plot; 1981-19-1