This is an adaption of a talk I gave at our weekly departmental Lightning Talks. This is the story of how we migrated our link shortener Ow.ly to new infrastructure.
Once upon a time, ow.ly was hosted on a single instance, and all was well. At least that’s what I have been told — it’s a story lost to the mists of startup history and engineers reminiscing over drinks.
Ow.ly grew into a successful product, expanding as most LAMP stacks do. Memcache was added to speed performance, load balancers were places in front of hosts running Apache+PHP, and the MySQL database was moved to separate hosts.
At the beginning of 2013, after 4 years of growth, ow.ly was now a pool of web servers fronted by a trio of load balancers with a quartet of memcache hosts and four copies of the 2.8TB database spread between regions. With the exception of the database growing beyond 1TB (the maximum volume size in most clouds), ow.ly ran mostly on autopilot during this time.
During 2012, our Operations team began implementing a “hybrid cloud” approach to our infrastructure to increase overall stability. One of these locations is in San Jose, California. This location has great access to network connectivity and to our partners. It was time to move ow.ly to this new infrastructure.
Our first task was to normalize the instances. As a rapidly growing startup, we sometimes find surprises hidden under rugs in our infrastructure that were perfectly acceptable choices when we were smaller, but need to be updated to our current practices.
Ow.ly was built using Golden Images, on an older Ubuntu Linux, using the web server as the base image to create other classes of server. Today we use Ansible to create source-controlled server images, we use Kerberos for login authentication, we use StatsD + Graphite for gathering statistics, and our base image uses a newer operating system.
We recreated the ow.ly hosts using our current image and best practices, and we re-learned a great lesson that we all knew, but chose to ignore — change only one variable at a time. Changing the cloud technology, network architecture, operating system, and application at the same time creates a serious case of “it used to work” that frustrates troubleshooting. Next time we’ll iterate updates first in the same region, then move once tested.
To begin the migration, we first pre-positioned data using our new server templates into each region that would be operational post-migration. This process took us approximately two weeks between the bulk data copy and the MySQL slave catch up. Once done, we had shiny new slave databases running current operating systems and versions of Percona MySQL.
Shadow deploy & limited dry run
Next we did a shadow deploy of the entire ow.ly service to the new region. We cloned the pre-positioned slave database to create a local master that could serve as a test database and be thrown away once testing had completed.
Once we were confident that the application was running correctly, we pointed it at the real database and we implemented a sneaky dry-run by transparently redirecting our own 250+ employees to the newly deployed stack. We could also have moved a percentage of live users, but we decided that the internal employee feedback loop would give faster notification of issues.
Go… but be ready to Not Go
Here we go! We redirected our East Coast load balancers to forward to our new West Coast load balancers and left the new stack talking to the existing East Coast database. This gave us the option to immediately pull the kill-switch and revert to the now-idle original hosts. The downside was a minimum 160ms increase in response time, as we were crossing the continent twice. We only ran ow.ly in this mode for about an hour, long enough to confirm all components could handle the load.
Change the Source Of Truth
Finally, we changed the master database to to new stack, updating our source of truth for the system. New slave databases were re-pointed to the new master, and old databases were finally decommissioned.
Commit to the Change and Clean up
After all systems had been changed over, we finally made the user-facing change to DNS records and pointed end users to the new load balancers directly. The ow.ly service also hosts a vanity URL service for HootSuite Enterprise clients, which we couldn’t update immediately, so we left the existing load balancers in place for six months to allow our amazing support team to communicate and coordinate the change.
Today, ow.ly continues to run within our West Coast infrastructure. We’ve grown way beyond early expectations, serving more than a billion shortened URLs, tens of millions of clicks a day, tens of thousands of SQL queries per second, and having painfully crossed the barrier of 32-bit IDs in tables (>4.2 billion rows!).
To summarize the steps we went through:
- Pre-position data
- Shadow deploy & limited dry run
- Go, but be ready to Not Go
- Change Source Of Truth
- Commit to the change
- Clean up
I hope this helps you with your future migration!