The DataLab team at Hootsuite is responsible for storing the constant firehose of data that our 10 million users generate for product intelligence analyses. All useful insights stem from a hardy storage-and-retrieval system that we can rely on. Recently, we migrated our primary storage back-end from Apache Hadoop to Amazon’s Simple Storage Solution (S3) in order to scale our ability to analyze data. What follows is a short retelling of the journey we took and the lessons learned while handling the nozzle.

Hydrant

Logstash as a Centralized Data Publisher

Our user data is funnelled from our web servers to a three-node Logstash cluster. Logstash was designed for this very purpose, and we throw application, system, and service logs at it. The most important characteristic of Logstash is that the logs are not stored but get collected and passed along via separate output modules. This critical cluster acts as a centralized data publisher and decouples the producers on the web servers from the consumers who only need to make direct connections with independent data sinks such as a cloud-hosted Hadoop Distributed File System (HDFS) clusters, Amazon S3 buckets, and a scalable ElasticSearch cluster.


output {
# Hadoop listens over ZeroMQ
zeromq {
address => ...
mode => ...
topology => ...
sockopt => ...
}

S3 {
access_key_id => …
secret_access_key => ...
bucket => ...
endpoint_region => ...
canned_acl => ...
}

elasticsearch {
bind_host => ...
host => ...
port => ...
cluster => ...
max_inflight_requests => ...
node_name => ...
}
}

Elasticsearch is Great for Search

We store the last two weeks of data in Elasticsearch for hot full-text searches. This is generally enough for providing our Customer Happiness team with a powerful view into application for debugging purposes. However, this becomes less and less scalable as your dataset grows: our Elasticsearch cluster may store over 500 GB day (and growing) with a replication factor of two. When the Kibana queries start firing off one after another, Elasticsearch becomes merciless towards your soon-humbled compute resources.

The Value of Data Deprecates Over Time

Our continuous integration pipeline allows us to release over ten times a day. Is it necessary to know that several months ago a PHP error bubbled up to a fix we have already deployed?  We did not see a good tradeoff between Elasticsearch’s accessibility versus maintenance cost beyond a two week timeframe. That being said, we had to store those valuable logs somewhere for later data mining.

Early Wins with Hadoop

No one can deny that Hadoop has a great ecosystem of diverse developer support. Its open-source offering of a distributed file system underneath a MapReduce engine really motivated us to pick it as our long-term storage platform. We then setup Apache Hive which our product analysts could use for ad hoc queries in a very accessible SQL-like language.

Hive has a Sting

To write a MapReduce application is somewhat of an art. Hive abstracts this process by  decomposing a straightforward HQL query into separate map and reduce operations. While convenient, it does not always produce the most optimized MapReduce jobs for any particular operation. As we started piling on more and more jobs into our ETL process, we quickly ran into the limits of our cluster’s resources. Routine jobs would take almost a day to complete which cut into our ability to run ad hoc queries. There were two options at that point: either scale out our cluster with more powerful nodes or use a faster alternative for the repetitive summarization and aggregation tasks. We chose the latter by leveraging a job service built on top of Apache Spark. Spark offers a great abstraction called Resilient Distributed Datasets for manipulating data at scale. In practice, in-memory distributed computation massively reduced the runtime of our ETL jobs on Hive by up to 90%.

Spark

S3 Performs Great with Spark

While we were burning rubber with Spark, HDFS began to have trouble keeping up. The Spark workers would slam the Hadoop cluster for data requests that would take down entire data nodes. Nodes would also need to be painfully respun from time-to-time when Amazon decided to retire its hardware. This created a high operations maintenance cost even with excellent service monitoring and server orchestration tools like Sensu and Ansible.

In the end, we decided to write a custom Logstash output module that piped our data directly to a S3 bucket and redirected queries to the corresponding AWS endpoint. S3 was now responsible for storage, replication, and responding to queries from our Spark service. Since making the switch, we have not noticed any availability or durability issues. The only disadvantage was that S3 had a modest increase in network latency in comparison to Hadoop because of the relatively close proximity between our Hadoop and Spark clusters. It was not overly concerning considering the massive benefits that S3 provided.

Conclusion

The transition from Hadoop to S3 made a lot of sense to us given the way we structured our data and collection pipeline. Provided with better infrastructure, our team can glean more insight into our user data faster than ever before. There are many other platforms out there that we are still excited to explore such as distributed linearly-scalable databases, streaming compute engines, and time-series databases. If you have worked or are working with data infrastructure, we would love to hear from your architecting experiences below or on Twitter.

About the Author
Po is a software engineer on the DataLab team at Hootsuite. He works closely with the Operations team and Product Analysts to create data systems for intelligent product insights. Follow him on Twitter at @poliu2s.