Upgrading a session storage system that services 5000 requests/second without downtime is no easy task. As part of continued efforts to harden our systems at Hootsuite, we moved from a legacy Memcached and MySQL session storage to a multiple availability zone Redis setup using Amazon ElastiCache. We ran into some issues along the way, but in the end we successfully migrated millions of sessions to the new system without any downtime. This post describes our strategy, the gotchas, and step-by-step of how we made this happen.

image04

User Sessions

Sessions are a critical service for all of Hootsuite’s users. When a user visits Hootsuite.com, they are assigned a unique identifier – a session id. We store temporary data for that session, such as whether they’re a member or not, in a record identified by that session id. On every request we get the session id from a cookie, and then look up the data associated with that id. If we can’t look up the data for that session id, then we don’t know who the user is and so will log them out immediately. Because session data is needed for every request, a robust session service is crucial.

Our Old System

During the early days of Hootsuite, sessions were stored using Memcached and MySQL Galera Cluster. Galera Cluster for MySQL is a multi-primary configuration that uses synchronous replication across its nodes. A write on any node will be replicated to all others before it is committed. The benefits of the multi-primary setup is that if one node goes down, the other nodes can continue to service reads and writes since they are also primaries. Memcached was used to cache the sessions to reduce read load on MySQL.

Original Session Storage Diagram
Session storage with Memcached and MySQL Galera Cluster

We had two main problems with this setup:

  1. If a Memcached node went down, the huge increase in reads on the database could cause MySQL to go down too
  2. Certain conditions could cause writes across all MySQL nodes to lock, freezing the cluster. Resolving this issue required manual intervention.

If any of these issues occurred, all of Hootsuite.com went down because everyone was essentially logged out and could not log back in.

New System Criteria

We needed a solution that would store sessions without the headaches caused by Galera. Our requirements were that the session store must be:

  1. resilient to failure, favouring availability during system degradation
  2. performant, since it would be called on every request
  3. scalable for future growth

In the end we chose, Redis and Amazon ElastiCache for our session storage. ElastiCache Redis perfectly met our system criteria:

  1. Resilient to failure: Used in thousands of production deploys, Redis has a proven record of stability. If it runs low on memory, it just evicts older records faster. Additionally, with ElastiCache we can place replica nodes in multiple availability zones and use a single endpoint that will seamlessly switch to a replica if a primary goes down.
  2. Performant: Redis runs in memory and we configure it without disk flushes. Its read and write complexity is O(1), so it performs the same regardless of data size.
  3. Scalable: Because our sessions expire, we simply set an expiry on our records in Redis and our memory usage reaches an equilibrium, currently at 5GB. If we need to scale further, we can replicate to a node with more memory and then promote it to primary.

Redis and Amazon ElastiCache

Redis is key-value store that stores all its data in RAM. Redis implements replication using a primary and replica where:

  • Writes go to the primary. Reads can go to either the primary or replica.
  • Writes are replicated to the replicas by asynchronous/non-blocking replication which means that a write is “committed” without waiting to be replicated
  • The primary can have multiple replicas, which in turn can have other replicas connected to them

Amazon ElastiCache is a service that provides an easy way to build a highly resilient Redis setup with primary/secondary in multiple zones with automated failover. It sounds great, but we wanted to see the failure modes for ourselves. We talked to the Amazon engineering team, and they were able to explain and simulate most of the data recovery scenarios.

When a node failure occurs, ElastiCache automatically detects and replaces failed nodes while keeping all the data intact. Working in collaboration with the AWS engineering team, we asked them to purposely fail our primary node and let the automatic replica promotion kick in while we timed the whole process. We noted that it took less than 60 seconds for the replica node to be successfully promoted as primary and start serving client requests. It also gives us the option to manually failover to read-replicas, and the failover happens in seconds. During a failover it even propagates DNS from the failed primary to the new primary so we don’t have to change our application’s endpoint. Finally, the biggest advantage of using ElastiCache was being able to have read-replicas in multiple availability zones.

The overview for the new session storage system is shown in the figure below:

New storage system
New storage system

Implementing in Production Without Downtime

After we decided upon our technology, we needed to test and implement it in production without downtime. Building the system and then switching over to it all at once is a recipe for disaster. Instead we needed a way to slowly test the system in production and then revert back to the old system if there were any issues.

At Hootsuite we rely heavily on “dark launching”, checking new code into production and then turning on and off execution of that code through a management console for a certain percentage of requests or users.

Dark Launch Stage 1

In the first stage, we wanted to test reads and writes to Redis but still rely on the old setup. Starting at 1% of requests and ramping up to 100%, we mirrored traffic on both MySQL + Memcached and Redis. On every read to MySQL + Memcached we issued the same read request to Redis, and the same for writes. During this testing we gained a lot of knowledge about how the system performed on production traffic.

First run
First run

Which PHP Client: phpredis vs. Predis

We started out using phpredis, a PHP extension, since it was already in the codebase. We noticed a high number of timeouts, 10s per second, during our peak traffic of 5300 requests/sec. Session service timeouts are unacceptable, because it means a user’s request fails or they’re suddenly logged out.

As an experiment we tried Predis, a PHP-only library. However, this somewhat predictably caused worse timeouts and almost crashed the web servers when run at near 100% traffic.

We switched back to phpredis and added retries for failed attempts. This helped, but at peak traffic we were still seeing timeouts. We needed another solution.

Twemproxy

One theory we had about the timeouts was that there was just too much network activity being sent during peak load, causing some packets to drop. Every single request would open up a new TCP connection to Redis and we were servicing 5300 operations/second. To reduce the network traffic, we decided to try a connection pool.

Twemproxy (pronounced too-em-proxy) is a Redis and memcached connection pool written by engineers at Twitter. It runs on each of our web servers, and the PHP workers connect via localhost to the daemon, which pools the requests and sends them along a single connection to the destination Redis server.

When we tried out Twemproxy during peak loads, we eliminated almost all timeouts. There were just a few cases where suddenly a single server would experience 100s of timeouts within 30 seconds.

Turn off Logging

We turned up logging on Twemproxy to try and diagnose the issue, but it actually exacerbated the problem. When we turned off logging completely, we experienced no more timeouts. Twemproxy usually runs without logging turned on however we had initially deployed it with logging since we wanted to be able to debug. We didn’t dig deeper but we correlated the timeouts with I/O delay on our AWS boxes so our theory was that Twemproxy was getting delays on log writes causing it to operate slower and hit timeouts on requests.

Dark Launch Stage 2

Finally, we had everything in place. For the next stage we continued writing to both setups but reads would come from Redis. We continued to write to MySQL + Memcached, so if we needed to switch back to the old system it would still have all the current sessions. We ran using this setup for a week without issues until all unexpired sessions for the week were in Redis.

Second try
Second try

Dark Launch Stage 3

After Stage 2 surfaced no further issues, we were confident that we were ready to switch over to Redis 100%. We turned off writes to Memcache/MySQL and were relying solely on Redis and Twemproxy. Success!

Final attempt!
Final attempt!

The final system overview is shown in figure below:

Final system overview
Final system overview

End Results

  • Since we launched Redis 9 months ago, we have had 100% uptime (knocks on wood)
  • With Twemproxy we haven’t seen any timeouts
  • Adding Twemproxy to our technology stack later helped us with connection pooling for Memcached as well

It took some time and a lot of trial and error getting here, but we’re really pleased with the results!

About the Authors

image07

Mishra is an Operations Engineer at Hootsuite. He works closely with the Operations Engineering and Platform teams to build, deploy, and maintain cloud infrastructure at Hootsuite. He enjoys playing soccer, DJing, and loves coding in his free time. Follow Mishra on Twitter @anubhavm.

 

 

image00

 

Luke is a Software Engineer on the Platform Team at Hootsuite. He’s currently working on splitting Hootsuite’s PHP monolith into Scala micro services so we can scale even further. When he’s not at his laptop you can probably find Luke hanging off a cliff rock climbing somewhere. Follow him on Twitter at @lkysow.