The Search team recently closed out another successful milestone. We integrated the Mapzen Who’s on First gazetteer into our underlying Pelias engine. For more details on why and how we did that, check out our previous post. As tradition calls in the software engineering realm, we held a post-mortem after-party! It’s a long meeting dedicated to celebrating what we did right, and understanding and learning from everything else. For more details on the benefits and goals of post-mortem meetings, check out this article.
To prepare for this after-party, we set out to make a list of questions grouped by general themes to help drive discussion, then we wholeheartedly dove into each grouping. The notes and takeaways of post-mortems are typically kept private, if ever looked at again. Mapzen isn’t a typical company, however, and we believe in the power of being open across almost everything we do, so why should our post-mortem findings be any different!
We’re sharing our takeaways in order to generate discussion with our users and contributors. We want to be transparent so the community can understand our plans toward improvements and help us celebrate the things we already do really well. We would also love to get feedback from the community on what’s working and what could be better. We’re talking about process and communication here, not just geocoding features. To give you a preview of how post-mortem meetings typically go:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way… – Tale of Two Cities, Charles Dickens
So let’s dive right into our notes, shall we?! We’ve included the questions along with corresponding groupings for context.
Do we feel like we accomplished what we set out to accomplish from the beginning of the milestone? Did what we were building conceptually change at any point?
Overall we do feel like we accomplished our goals and didn’t redefine the scope to any great extent. There were a few things we ended up pushing out of the milestone, but they didn’t have a huge impact on the milestone goals. Tackling those things separately makes more sense anyway, in a more incremental fashion.
We ended up postponing the ability to return geometries from the
/place endpoint and held off on making decisions about how to handle multiple administrative hierarchies for places.
It was hard to keep on-task because there were heaps of other stuff we wanted to work on, and in the end we had to choose what not to do, however, we don’t feel like we took a lot off the table.
We also thought we would be able to sunset the use of Geonames in our service, but it proved to be a key dataset to many users, so we spent many additional cycles refactoring our Geonames importer and making sure that data was still available and better than ever. It was an unexpected twist and we dealt with it as best we could. In the future, we should determine the viability of sunsetting any feature or dataset more concretely before assuming it would be ok to remove it.
In addition to our actual goals, we ended up refactoring MOST of our code! There were parts that hadn’t been touched in years and we did a lot of work to bring them all up to our current quality standards, along with proper tests… hooray for testing!!! This felt great and was necessary, but did have a significant impact on our timeline. In some cases, we weren’t being honest with ourselves when estimating tasks, so it felt like some things dragged out beyond expectation. That made us sad.
The awesome side-effect of the Winter of Refactoring is that it significantly raised team ownership and competency in ALL parts of our codebase. Just a few short months ago we welcomed two new members to the team, and as expected, learning the ropes in a large distributed codebase takes time. With each team member rotating through the various components and all of us participating in frequent deep-dive sessions, we were able to get everyone up to the same level of code ownership in a much shorter period of time.
Reflect on size of iterations, effort estimation, scope and determine if we know enough now to make good decisions going forward?
Our biggest failure in this project was assuming the feature would be straight-forward and easy enough to do without a major upfront spec. We didn’t break the large milestone into smaller bite-sized pieces and worked in a way that didn’t allow for incremental releases. As things came up that would inevitably postpone the release, we were never in a release-ready state and had to keep putting off rolling any changes to production. The BIG release became bigger and scarier the more work went into it. Let’s never do that again!!!!
Looking back, we could’ve released just the Who’s on First importer first, bringing that new data into our build without changing all the other importers. Once that was out, we could’ve focused on the other existing importers, like OpenStreetMap, OpenAddresses, and Geonames, one at a time, then turn our attention to API updates to support the newly imported/updated data.
We have agreed as a team that going forward we will abide by the 2 week release cycle. If something can’t be done in 2 weeks, it should be broken up in such a way that allows it to span over several 2 week release cycles. We hope the community of users and contributors will help keep us honest here! :)
During the integration process, we became extremely dependent on another team for some of the work. We were integrating with a brand new moving-target of a project. This added a level of complexity we didn’t account for in our estimates. Even simple things, like cross-team communication, were often complicated because timezones, since our two teams span Berlin-New York-San Francisco!
Being the first consumer of a gazetteer certainly helped identify many areas of improvement on both sides of the equation. We consider that a great success and undervalued byproduct of this milestone. The team was also in agreement that the level and turnaround time of the support coming from the Who’s on First team was really great and very much appreciated.
Daily stand-ups? Too (in)frequent? Too long/short? Too (un)structured?
There was shared sentiment that we tend to get sidetracked during our daily standups. We’ve agreed to do our best to stay on track and stick to the traditional standup format of “what did you do yesterday?”, “what will you do today?”, “what are, if any, your current roadblocks?”. We’ll hold stand-up after-parties (you can tell we really like parties and like to keep them going as long as possible) to discuss things at a greater depth. Those not directly involved in after-party discussions can choose to stay or go, without judgement.
Did everyone feel connected / supported / informed enough to perform their assigned tasks?
Generally knowledge sharing and support from teammates has been at an all-time high! We’ve gotten into the regular habit of doing deep-dive sessions, where one person shares their screen and walks the others through a single feature they are working on, or we triage a reported issue in the same way. These weekly, if not more frequent, sessions have really elevated our ability to understand what everyone is working on beyond the superficial. It also allowed us to rotate through various parts of the code with more ease. We’ve probably at least doubled the bus factor numbers across all parts of the codebase. More deep-dives, FTW!
Stats and Insight
We’ve focused so much of our attention on improving search results and building out features, that we’ve neglected the supporting utilities and charting/logging. It’s become evident that we need a proper dashboard that we can look at every stand-up. More insight into service performance, parameter usage, confidence levels of results, etc., would increase our confidence that we’re making the right decisions. They would also allow us to track that those decisions are having the expected impact. We’ve already made this a priority for Q2 of 2016 and are working on making it a reality. We’ll do our best to make sure those dashboards are public, so the community can benefit from the insight as well.
What are the highlights and pain points of our current build process, branching strategy, merging, and deploying to various environments?
During this milestone we were working with the following stacks, where a stack consists of an Elasticsearch cluster and an API server:
development stack, with a full world index
staging stack, which is used to build production quality Elasticsearch snapshots and test them before promoting to
With only a single
development server we weren’t able to test features in a timely fashion. We’ve decided in order to ensure minimal delays during development and testing, we really need two
development stacks. This will allow us to make changes to the Elasticsearch index during experiments, while still continuing to make API-only improvements. We agreed that we should only conduct one schema/index experiment at any given time.
We’ve also decided to use the
staging stack to continuously run fresh builds. We currently only kick off a new build once a week unless something urgent comes up. We’d like to keep that stack busy with constant back-to-back builds that will only pause if something breaks or acceptance tests fail at the end of a build. Pausing after a problem gives us a chance to triage and restart the process manually once things are cleared up.
We talked a bit about our current branching strategy and decided that it warranted its own meeting. Stay tuned for some of the highlights from that in the future. This means data changes in OSM and other sources will be reflected in search much sooner: within 3 days instead of 7! Get excited!!!
Did we capitalize on opportunities to be public about our work? Did we do a good job supporting our growing community of users and contributors?
The general sense was that it’s hard to blog about a work in progress. We did put out some great posts about What Geocoding Means and how we envision it done right. We’ll be shooting for a much more predictable blogging cycle. Ideally, we’ll have something interesting, even if small, to share with the community. Team members will take turns being on the hook for a post.
One major concern that surfaced around outreach is that external contributors need more love!!!! We’ve been lucky to have some great pull requests come through during this milestone and we dropped the ball on responding to those within a reasonable timeframe. We decided it’s really important to respond quickly to a PR, especially if we don’t think it’s something we can ever merge or it doesn’t fit in with our current roadmap.
We realized that it’s important to clearly outline our expectations for contributions: be clear about the importance of providing unit tests and including a solid description of what feature the changes are implementing/fixing.
I don’t know about you, but it felt great for the team to get closure and have some plans for what’s next. We all went out to celebrate with a round of mini-golf on Pier 25 in NYC!