So you’ve written this great new feature and you can’t wait for your customers to see it. The tests are passing and it looks good on your laptop. Now you just have to get it out in the open – but how do you do that?
The answer is the deploy system – the process of making the technology we create available to our customers, in the shortest time possible, checking that no bugs are caused in the process. There are many possible ways to build and deploy software, each with their own merits. Focusing time and attention on an effective build system means delivering functionality customers as opposed to losing time with processes that could otherwise be automated.
Over the past few months, we’ve been working on a new deploy system for Hootsuite Analytics. We can’t stress enough how important this process is – without it, we’d have great products that nobody would ever see. Having millions of customers, we want to be able to quickly and safely release new changes to all of them – this requires a deployment system that is consistent, fast and finds as many issues as possible before shipping out changes. This post describes our move to a Blue-Green deployment system for our Hootsuite Analytics product.
What exactly are we deploying?
In order to understand our needs, let’s go a bit through what we need to deploy. There are two main components – the web servers (we also call them web nodes) and the data processing pipeline. This post is all about the web nodes, so let’s focus on those. They are what customers directly interact with, so besides having to be up all the time, we cannot have any interrupted requests. This makes deployment a bit tricky, because we need to be sure that a server is not shut down while replying to a request.
Where we started from
First, let’s talk about our old deploy system. We inherited this from the old uberVU days, where we only had enterprise customers with pretty regular usage patterns. Thanks to this, we had different scaling needs and a limited set of requirements. To accommodate deploys, we had a simpler process based on fabric. There was a known list of web servers and on every one of them we’d run a set of commands that would basically fetch the latest code available and restart the servers. Everything was integrated with GitHub, so a push to the staging or master branch would trigger the respective deploy and so on. This is the simplified version of events but the main idea is that we rolled the latest code out to all the servers.
This approach clearly has drawbacks. First of all, requests might be dropped, so we had to be careful when deploying. Adding a new server (to cope with a traffic spike for example) would require updating config files. An emergency revert to the previous state was pretty tedious (we did keep the older versions of the code on the servers in order to do this via a symlink, but it still required some work), and so on.
Where we’re going
To be able to eventually scale Hootsuite Analytics to the millions of Hootsuite customers we needed a better solution. We came up with a list of requirements:
- builds should be reproducible – no surprises between local and staging/production
- improved scalability
- deploying code in production should be as safe as possible – it would be great to first test the new version on a small set of customers
- emergency reverts should be easy
- we should catch as many errors as possible before deploying to customers
Looking at available approaches, we decided to go for Blue-Green deployment. The core idea is that there are two instances of the production environment – the current, working one (called “blue”) and the one being deployed (called “green”). The blue environment serves most of the customers , and a few of them are directed to the green one following deployment. If all our system health checks look good, we scale up the green environment and we turn off blue. Otherwise, green is destroyed and everything starts over. After we turn off blue, it is not immediately destroyed in case of an uncaught error. Keeping the blue environment around until we’re certain the new changes did not introduce any bugs makes it easy to revert back to the working state by simply re-activating it.
The main pieces of technology we decided to use to put our plan in motion are the following:
- Docker – for stateless, reproducible builds together with docker-compose for providing the needed services such as databases when developing
- Jenkins – for implementing and orchestrating all the jobs
- Asgard – for creating clusters and handling ELBs
- ELB Sticky Sessions – for making sure our customers don’t get randomly routed between blue and green
- Hubot – for making it easy for developers to trigger deploys
In a nutshell, we build a Docker container that has all the code within it, we package everything as an AMI and then we spawn machines using it. Let’s see how we achieve this, step by step.
Our new deployment pipeline
Meet our new deployment pipeline. Mostly all of it is specified using the awesome Jenkins Build Flow plugin. Should any step of the flow fail, our system sends a HipChat notification to our #Deploy room. Also, the build flow plugin makes everything easy to visualize.
Triggering a deploy
Once a developer wants to merge a pull request into one of the testing or master branches, all she needs to do is to write @AnalyticsBot merge <branch-name> in <branch-name>. This starts the Jenkins flow.
The first step of the pipeline checks that the branch that needs to be merged and applies cleanly on top of the target branch.
Run tests on the merged branch
Although all pull requests (PR) have tests run and pass code review, it is sometimes possible that even though tests pass on the tip of the PR, when merged with the target branch (which may be picked up some other changes in the meantime) some tests might fail. This is very rare in our experience but it really causes problems when it happens. To avoid this, when deploying we create a temporary branch that is the result of merging the PR branch on top of the target branch, and re-run all tests on it.
Note: the target branch stays the same, nothing has been merged there yet.
Building the web node AMI
Having all tests passing, we start working on the bundle that we want to deploy. For this, we launch an EC2 instance (we call it the bake instance). In it, we build the docker container needed to run the web code. When all is done, we package everything into an AMI and store that image.
Creating the green cluster
Using Asgard and the AMI we’ve just created, we start a new, green cluster. Right now, the cluster is running and is capable of accepting requests, but it is not subscribed to any ELB.
Run live tests against the green cluster
Besides the regular, “offline” tests we also have a set of live tests that are designed to be executed against a running server. They mimic the requests that our front-end performs when customers interact with the website. These tests increase our confidence in the deployment system, because we also get to test the actual running service before customers get to interact with it.
To achieve this, we have a separate test ELB for every environment (staging/production). We subscribe the green cluster to this ELB (so the test runner can reach it), we run the tests and finally we unsubscribe the cluster.
Merge the code into the target branch
By now, we are pretty confident that these changes work. They have passed all the tests we’ve thrown at them. This step simply merges the code into the target branch and publishes these changes (by pushing to GitHub).
If the target branch is master, at this point the GitHub PR is automatically closed (by GitHub) and the PR branch is deleted (by another job). If the target branch is testing, we have an extra job that adds an In Testing label to the PR.
Providing the list of required steps
The green cluster is live, but there are no requests routed to it at the moment. We still need to perform a few steps. In order to make onboarding easier, after every deploy we display a notifications of the command needed to list the next steps. Through that command, our bot helpfully lists what needs to be done.
Switching to the new cluster
Now we simply have to route customers to the new cluster. We do this by subscribing the green cluster in the appropriate ELB. The cluster starts out small, so it will only handle a fraction of the requests. If no errors show up, we can scale up the green cluster and remove the blue one from the ELB. All the interactions with AWS (starting clusters, scaling them, subscribing to ELBs and so on) are handled by Asgard. Also, thanks to Connection Draining, we won’t drop requests.
Finally, if all is still well, we can destroy the blue cluster. That’s it, the changes are now fully and safely deployed!
We have a shortcut for our staging environment – everything up to (not including) destroying the blue cluster happens automatically.
After switching to the new system, we started to notice that building the AMI took close to 20 minutes. We were fine with the overhead of the new deploy system, but we felt that we could improve this particular operation.
We ended up bringing the AMI creation step’s time somewhere south of 10 minutes. We achieved this by “caching” some steps (we now run them when building the base image that the bake machine starts from) and by eagerly starting the bake machine (i.e. before it is actually needed).
Another issue we had was that changes to the data pipeline had to also go through this process, and it meant waiting for unnecessary steps related only to deploying the web node. To get around this, we separated what jobs are triggered based on what folders are changed in a pull request. Thus, if we only have changes that don’t affect the web node, steps such as building the AMI are skipped.
We’ve built a deployment system that we feel is easy to use (the developer only has to trigger the deploy and switch the clusters – and all interactions are performed through HipChat) and covers all the requirements we outlined. Thanks to it, we can now easily deploy changes with high confidence. And if things go wrong, we are able to catch on to them early and quickly revert to the last working state.
Furthermore, thanks to Docker and the way in which we’ve structured the build flow jobs, nothing is tied in any way to a particular programming language. Thanks to this, if tomorrow we were to drop Python and rewrite everything in Scala, we could still use the same deployment pipeline to publish the changes to our customers!
Of course, this is just the first step in our quest to 100% automated continuous delivery. Ideally, we want to reach a state where developer interactions will no longer be needed. This iteration puts us on the right track, and we’ll continue working to improve this process!
Mihnea is part of the Hootsuite Analytics backend team in Bucharest, Romania. Whenever he’s not working on features of the new Analytics product, he’s looking out for ways to improve the tooling and infrastructure behind the product. You can follow him on Twitter @mihneadb.