TL;DR

  • Create a timeline for the day so people know what to expect
  • Set a final commit time for everyone that needs to get code to production on upgrade day
  • Jenkins helps control what code makes it out to production
  • Vagrant ensures consistent dev environments
  • Ansible makes it easy to provision your servers
  • Set a clear point of no return. If you’re not 100% comfortable with the way things are going, cut out early and try another day

Question: How do you upgrade the driver between your data store and web application layer, for a team of dozens of developers, building a product for a massive user base that runs on dozens of web servers — all without putting the brakes on code commits, and without any production downtime?

Here at HootSuite, we use MongoDB and PHP, among many other things. In between these two technologies sits a driver that, like any driver, needs to be updated from time to time. In this article, I’ll describe how we recently completed this tall task, and how we did it without stalling productivity and without downtime.

Our Senior Ops Engineer described this process as: “A semi-truck rolling down the highway at 50 miles an hour. You bring a tow truck up to one side, raise it up, replace all the wheels, drop it back down, raise the other side, replace all the wheels on that side, and drop it back down to carry on its journey without ever stopping.”

Create a timeline for the big day

In part 1 of this series, “Mongo Message Migration,” I recommended creating a plan leading up to your migration and posting it for all to see. In this instance, because you’re upgrading server software without a major change in functionality, it’s less important that everyone in the company know your plan. There is one group, however, that needs to know exactly what’s going on: your fellow engineers.

At some point, there’s likely going to need to be a release-freeze, where only you and others involved in this process are committing code to production. Later, once the driver is updated, everyone will need to update their development environments. It’s helpful that they have an idea of how the day will be shaped, so they can plan their meetings, lunch, etc.

Here’s the timeline I used for our big day

Set a final commit time for everyone else

At HootSuite, we release to production all day, every day, many times throughout the day. If it’s the same on your team, then you’re going to need to shut down releases to production at some point in order to ensure only your code changes are out in the wild while you’re testing.

I would highly recommend stopping production and staging releases during the day until all servers are fully upgraded and tested. If you allow people to release to production and something goes wrong with their code, it may be difficult to distinguish if the problem lies with your update or theirs. Keep things simple on such a major upgrade.

Jenkins is great for controlling code that gets released

Continuous deployment at HootSuite is in large part enabled by our highly customized Jenkins build system. The PHP-Mongo library upgrade was yet another situation where our Jenkins automated deployment processes helped reduce risk by ensuring that each release happened in sync, with a clear path of escape if we needed to go back.

Vagrant keeps all your devs in order

When you allow a bunch of engineers to control their own dev environment, the results will usually vary greatly. Something as simple as a driver upgrade in their environment could become a bit of a nightmare. You need a way to standardize everyone’s development space so that everyone is seeing the same thing, and so changes like a driver upgrade are easily accomplished.

Vagrant is a tool that allows you to write a script to create a development environment. When you do this:

vagrant up

… the vagrant app will build your dev environment automatically based on the script you write, and launch a VM for you with all of the required settings. This keeps everyone working with the same setup and configuration. Having your Vagrant config file committed to your revision control system will ensure it’s easy to keep everyone up-to-date with the settings.

Ansible allows you to script your server updates easily

Imagine you have hundreds of web servers all serving an app to your users. Let’s assume that they’re all running exactly the same software. Now you need to upgrade a driver on every single one of them. Difficult task, right? Do you jump in and start upgrading each of them by hand? This is doable, but takes a lot of time.

With Ansible, you only need to edit a couple of files using a simple declarative language called Playbook. You write your configs, launch Ansible, and POW — it connects to all of your servers and provisions them correctly for you.

Have a clear, defined, point of no return.

We had a group of five people testing all of our Mongo-related features on staging once the process was underway. I defined one hour as our available testing time, and 30 minutes after that to have all critical issues dealt with before rolling out. If we didn’t meet this time criteria, all hope would be lost for the day, and we would roll back and reschedule the transition.

Thankfully, this did not occur, but it sure helped our stakeholders to feel good that there was a clear plan to bail on such a major process if things weren’t going well. This is also where it came in super handy to have our releases held back. If anything went wrong, we could roll my changes out, deploy everyone else’s changes for the day, and delay our server upgrade to another day.

This is not an easy process if you don’t have the right tools. I hope my learnings above give you some insight into how to make an upgrade like this run smoothly.

Cheers