Lately, there’s been a giant demand for data and statistics that can be used to help make business decisions. If you’re working as a developer or engineer on a team that’s tasked with gathering this data, you’ve probably noticed that this demand has translated into more information and events being tracked, and more pressure on the systems you use to extract, transfer, and load your data. You may have even run into a situation where your current systems start to buckle  under the increased load, and reliability and stability begin to degrade.

This year, the Datalab Team at Hootsuite ran into this problem and came to the conclusion that using Logstash in our data pipeline wasn’t working – we needed to migrate to something else. This blog post outlines what we did to make our transition smoother, and will hopefully provide some useful tips if your team is looking to change part of your extract-transform-load (ETL) system.

Our DataLab team recorded almost a 40% increase in the amount of events we were tracking in a one year period
Our DataLab team recorded almost a 40% increase in the amount of events we were tracking in a one-year period

First Steps

The first step was to analyze our current setup to determine the problem. Being able to clearly articulate what you’re not currently accomplishing but need to and what you’re currently achieving and want to keep doing is key when you need to make comparisons about what product you’ll use to replace your old one. In our case, the biggest problem we had with using Logstash was that it was no longer reliable enough for the increasingly important role our data was playing, and couldn’t be used to record data from front end sources in our dashboard. We weren’t able to manage our cluster effectively, and were plagued by downtime. Worse still, after downtime there seemed to be data loss for a period afterwards, and we suspected that even in a healthy state we were losing a consistent percentage of data. Missing and inaccurate data was a huge problem with Logstash and therefore our top priority to resolve when making the migration. On the other hand, what we did like about Logstash was that it evolved organically from the product’s logging and wasn’t extremely complex. For our smaller team, we needed a product that was not too complex as we couldn’t devote an exorbitant number of man hours to get the new pipeline up, nor to maintaining it when we had new projects.

By analyzing our current problem, we came up with the following criteria:

  • The new product had to be easily managed
  • It had to have minimal downtime and a way to recover data when downtime occurs
  • It had to be more reliable when sending data from the dashboard, without slowing down actions on the dashboard

With these in mind, we were able to more effectively weigh our options for our Logstash replacement. We decided on Kinesis, because we were confident that we could manage the system, it could store 24 hours of data at a time, and it sent messages using TCP rather than UDP protocol.

The best decision our team made during the whole process was to create a small proof of concept with the replacement before diving into replacing our entire legacy system: it’s so much better to find out if your choice is unsuitable for what you’re trying to accomplish before you’ve made the full replacement instead of after. By making a proof of concept, you can start to compare the new system to your requirements earlier and effectively measure if you’re actually getting what you need from the system.

We made our proof by adding in new logging for front end events in our dashboard, something we hadn’t been able to achieve with Logstash. This was a big request from our team, but something that couldn’t be done with Logstash. If, like us, your team, has a necessary requirement that can’t be done with your current system, it’s a great feature to implement in your proof as it should clearly show the new system can doing something the old one can’t. If you don’t have one specific feature that needs implementation but rather need overall performance improvements (like being able to handle more load), then it might be most useful for your team to reimplement an existing section of your system so that you can make direct performance comparisons between the old and new implementation of the section.

Making the Change

When building our new system, the most important and beneficial decision we made was to keep both our old and new system running in parallel. This meant that the new system had to be able to take the same inputs as our old one, and all outputs it returned had to be compatible with the next steps in our processing pipeline. We were not looking to completely redo our entire ETL process, just one system in it. Even if your team needs to replace multiple systems, and it might seem less time consuming to completely overhaul them all at once, it saves so much QA effort to do them one at a time, as it becomes much easier to show correctness. By running both the old and the new system at once and having any data that needed to be recorded from the dashboard run though both systems we were able to make side by side comparisons of both the raw data from the dashboard and the aggregated results.

Isolating the component you are replacing leads to a smoother migration and easier QA
Isolating the component you are replacing leads to a smoother migration and easier QA

Although we did not do this during our migration, it would have been even better if we had built up our new system by doing ‘vertical slices’ of minimum viable products for different events and QAing after each slice (rather than doing ‘horizontal layers’ and having a bulk QA at the end). That way, if we had any sort of fundamental misunderstanding of how to apply our new system, it could be caught and corrected much earlier instead of after the entire implementation had been completed. Using this strategy could be extra beneficial for teams that are worried they won’t have the resources to keep both the entire old and new system running in parallel during the entire migration. Although not perfectly safe, it would be a lot more feasible to gradually strip out the slices of the old system once the same slice in the new systems has been developed and QA’d, thus only having a small component of each system running in parallel at a time.

Checking your Results

An important part of our migration process, and unfortunately one we underestimated, was the QAing of the data that we processed though each system. This is the time to perform a sanity check on the two datasets, contrast what the differences are and understand what the exact impact on stakeholders will be. Some of your differences might be expected – we saw a steady baseline of about a 2% increase in data from the new Kinesis system, and fewer outages. This helped to show that our requirements were being met by the new system where the old one was failing. However, some of our differences were completely out of left field. Our most striking one was that our new system was reporting almost 30% more scheduled messages being sent out daily than our old system. These differences required a large time investment from our team to resolve, as they weren’t an immediately apparent logging mistake by one of the systems. Instead, they required investigating through the raw data and making comparisons on specific fields rather than just the aggregates we needed. In the case of our message discrepancy, we needed to not only compare the raw data from our two systems but collaborate with other teams to gather more information when there was not a clear answer as to which dataset was correct.

To help your team QA successfully, you should definitely make sure to store your raw data instead of just aggregates. Once you have the raw data, it’s a lot easier to pinpoint what went wrong, especially if you have some sort of visualization tool. We didn’t have a way to visualize our raw data, and some of our errors required a lot of different queries to figure out the cause. Many times once we figured out the root issue, we realized that if it could have been visualized the difference would have been highlighted almost immediately.

Conclusion

The strategies that we found most beneficial and that we would definitely do again when we next need to migrate a system in our ETL pipeline are to: have a clear understanding of what your current system lacks and that your new system needs to accomplish, creating a Proof of Concept to show your new choice is capable of doing what you need before you get too bogged down in reimplementing the system, running both your old system and your new system in parallel while QAing them, and having the ability to compare low level data set rather than just aggregates. The biggest thing that we missed doing that would have helped our migration is better planning. We didn’t really understand how to budget our efforts and ended up backloading our work to the QA process. What we should have done instead is have a greater focus on the MVP for our new system and planned the QA for it as we went through rather than leaving a pretty time intensive process to the end. Hopefully you’ll be able to learn from what we did right and what we missed during our migration so that your own migrations will go smoothly and successfully!

image01About the Author

Alexandra is a co-op software developer at Hootsuite on the Data Lab team and spent a portion of her term helping with the migration from Logstash to Kinesis. She loves tackling challenging problems and uses that enthusiasm as an avid maker who counts knitting and sewing as some of her favourite hobbies.