During my time here as a member of Hootsuite’s Customer Happiness team, I’ve come to understand our techniques for debugging a large scale web application. I want to share the five best techniques I’ve learned for debugging issues on a web application as massive as Hootsuite, though they can really apply to debugging web applications of any size.
2. Check Any Error Logs your Application Maintains
After gathering information from reproducing the issue in the browser, another valuable resource can be any aggregated logging your application may have. Here at Hootsuite we use the Elasticsearch ELK stack (Elasticsearch, Logstash, Kibana) for our log aggregation. Using Kibana, I can search for any errors, exceptions, or logs that we have captured. Kibana can help refine the scope of an issue by providing details like: how many users are affected by the issue, when did the issue begin appearing, and if there are any patterns around the occurrences of the issue. Recently we experienced a minor issue relating to our PDF renderers, and logging proved to be an excellent resource to help narrow down where the issue was occurring. With Kibana, I was able to quickly find that the servers experiencing the issue were only active on Mondays, which narrowed down the issue to our servers automatically spun up to handle the load of weekly reports we generate. Having a timeline and a scope for an issue can be just as valuable as knowing what parts of the codebase are related to the issue.
Having a timeline and a scope for an issue can be just as valuable as knowing what parts of the codebase are related to the issue.
3. Reproduce the Issue on a Development Environment Using a Debugger
Some issues aren’t always reproducible in a development environment. For example, the state an account is in can be extremely complex and not always easily replicated. However, when an issue can be reproduced on your development machine, profilers and debuggers can be tremendously helpful. At Hootsuite we use Xdebug for our PHP stack. Xdebug lets the user link a browser to a listening IDE (in our case PHPStorm or IntelliJ IDEA) and set breakpoints within the code. Using this we can easily check the values of any variables and parameters, as well as step through the code and trace the path of the issue to find its true source. Debuggers are much faster than any logging and is preferable anywhere possible.
4. Embed Logging to Monitor the Issue
That said, sometimes additional logging is the only way to trace an issue. Logging is a time-consuming approach, and should be implemented after much design, thought and manual tracing. Pushing code to production causes a large delay to any changes that are needed, so it’s much more beneficial to take time and have a complete push of all the logging you need at the appropriate location, rather than having to push several sets of logging as you find them needed. After releasing the logging, it unfortunately becomes a waiting game for the issue to re-occur before we can study the logs to find the issue and resolve it. Because of the nature of authorizing social networks and adding them to our application involves authenticating with external parties, this is often one of the only methods available to resolve issues surrounding that code. It is also worth noting that great thought should be put into designing good logging around your code when writing it in the first place. This can help trace issue immediately in the future rather than having to add them after the fact.
5. Check Logs and Test Code on Server
In other cases, being unable to reproduce the issue on our development environment is not due to the state of the account, but rather the servers themselves. While issues related to server configuration are far less common (especially after we introduced Ansible to help automate our server provisioning), they can still occur. Large scale applications development environments can never completely mimic production and that can cause difficulties for testing, as we say at Hootsuite: “learning happens on production”. To check for this, we often check the logs in Kibana to see if an error or exception linked to the problem is originating for a specific server or set of servers. If that’s the case, then we can SSH in to the servers and check access and error logs to ensure the code is communicating properly with the other servers in our system. Often these logs can help immediately identify any configuration issues and help resolved the problem.
These are just some of the techniques I use. I’d love to hear from you about any different approaches you use, or even just to say you found these suggestions helpful, in the comments below or on Twitter.
Mackenzie is a Software Engineer on our Core User Experience team. He works closely with our Customer Support team to help resolve any technical issues our users encounter. When not coding, he can usually be found enjoying a local craft beer. Follow Mackenzie on Twitter @MackMarshallVan.