Software-as-a-service organizations live and die on service reliability, performance, and uptime. Hootsuite monitors these key metrics for all our services, and effective monitoring is especially vital as we steadily migrate to a service oriented architecture. In this post, we outline ways our Analytics teams monitors Hootsuite services.Bucharest

Latency

If you’ve used an API service before, you’ve probably run into these issues:

  • A service you depend on unexpectedly stops responding
  • Unusually slow response times

We need to be prepared for these scenarios, and the first step is monitoring. Afterwards, we’ll be able to define clear expectations for our API responses.

Long Running API Requests

We’ve been happily using StatsD to collect metrics in our system for some time. Show below is a graph showing Maximum Response Times vs Average Response Times over a 2-hour timeframe, and next to it we’ve plotted the actual number of requests in the same timeframe:

Screenshot 2014-10-13 11.41.06

The averages graph can be very deceptive, as  you wouldn’t expect those 15-second spikes when most of the requests are responding under 20ms. You might think: You have a couple of long requests in a 2 hours timeframe. Surely it can’t be that relevant, since most of the responses are coming under 20ms. Maybe it’s the busiest time in the day? Maybe it’s the busiest day of the week or month?

Monitor your service latency in order to respond clearly to those questions: when IS the busiest time of the day or busiest day of the week? Keep in mind that maximum response time is per minute, and Graphite always chooses the maximum value encountered in a minute to display in the graph.

Collecting Data in StatsD

From the reference manual:

The statsd server collects all timers under the stats.timers prefix, and will calculate the lower bound, mean, 90th percentile, upper bound, and count of each timer for each period (by the time you see it in Graphite, that’s usually per minute). The *upper bound* is the highest value statsd saw for that stat during that time period.

Seeing a 15 second response time on the graph doesn’t mean that it was the only long-running request.

If you are plotting in Graphite multiple series to find the long running requests, make sure to use the maxSeries function. It will help preserve the maximum time found in any of the series.

Resolution of the Data Stored in Graphite

If you choose to plot the series for longer periods of time, make sure you understand how Graphite is storing the older events. If you put the rule shown below in storage-schemas.conf, then you should expect a resolution of one minute for the activity in the last day and 5 minutes for the activity in the last week:

[stats]
pattern = ^stats.*
retentions = 1m:2d,5m:21d,15m:5y

Histograms in Graphite

By now it should be easy to start detecting long running requests by using StatsD and Graphite, but what about histograms? In a default Graphite installation, you don’t have many UI options, but you can still get an idea about the percentage of requests that are executing in more than 2 seconds (in red below).

Screenshot 2014-10-13 15.27.56

Key Takeaways

What are the key takeaways when monitoring service performance?

  • Measuring average response times is a bad way of assessing service performance
  • Maximum response time data offers up much more useful information
  • StatsD + Graphite is a match made in heaven (with some tweaking)

Further Reading

Accurate Accounting with StatsD and Graphite

About the AuthorValentin is a Software Engineer on our Hootsuite Analytics team. He is passionate about using technology to solve real problems. He was inspired to write this post after viewing an article entitled “How NOT To Measure Latency”. When he isn’t dancing with data, he can be found on Twitter @ValentinZberea.