We’ve started using Elasticsearch for a few of our projects. It’s a great tool for storing and querying giant text datasets. In the words of the creators, “Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine”. We like it because it’s fast, easy to use, and very useful in many situations.

One of the things that Elasticsearch does very well is to listen to a large stream of data and index it. This is done very easily using a “river“. A river is an easy way to set up a continuous flow of data that goes into your Elasticsearch datastore. Quoting once more the creators, “A river is a pluggable service running within Elasticsearch cluster pulling data (or being pushed with data) that is then indexed into the cluster.”

It is more convenient than the classical way of manually indexing data because once configured, all the data will be updated automatically. This reduces complexity and also helps build a real-time system.

There are already a few rivers out there, among which are:

  • Twitter River Plugin for ElasticSearch (link)
  • Wikipedia River Plugin for Elasticsearch (link)

There was no GitHub river, though, so we setup to build our own. And thus Elasticsearch GitHub river was born: https://github.com/ubervu/elasticsearch-river-github.

We’re big GitHub fans and use it extensively to manage our daily activities. Sometimes we find ourselves looking at heaps of issues that scream for attention. To avoid this pile up, we want to understand better how our team works, how issues end up being forgotten, and how to address the real important ones first. The GitHub river was built to give us this additional insight we crave, and hopefully it will help you too.

Now let’s get down to business and use it:

Using the GitHub river

The GitHub river allows us to periodically index a repository’s events. The repos can be both public and private. If you have a private repository, you’ll need to provide authentication data.

To get a taste of the workflow possibilities this opens, let’s index some data from another open source project we like, use, and contribute to: Lettuce. We can explore it in a pretty Kibana dashboard (click on the image to get the full version).

gh-kibana-dashboard

Assuming you have Elasticsearch already installed, you’ll need to install the river. Make sure you restart Elasticsearch after this so it picks up the new plugin.

Now we can create the GitHub river:

curl -XPUT localhost:9200/_river/gh_river/_meta -d '{
    "type": "github",
    "github": {
        "owner": "gabrielfalcao",
        "repository": "lettuce",
        "interval": 3600
    }
}

Elasticsearch will immediately start indexing the most recent 300 public events of gabrielfalcao/lettuce. The 300-event-limit is imposed by the GitHub API policy. After one hour, it will check again for new events.

The data is accessible in the gabrielfalcao-lettuce index, where you will find a different document type for every GitHub event type.

Using Kibana to visualize the data

Great, we have some data. Now what? In order to make some sense of it, let’s get Kibana up and running. First, you need to download and extract the latest Kibana build. To access it, open kibana-latest/index.html in your favorite web browser.

What you see now is the default Kibana start screen. You could go ahead and configure your own dashboard, but to speed things up we suggest you import the dashboard we’ve set up. First, download the JSON file that defines the dashboard. Then, at the top-right corner of the page, go to Load > Advanced > Choose file and select the downloaded JSON.

That’s it! You now have a basic dashboard set up that shows some key graphs based on the GitHub data you have in Elasticsearch. Furthermore, thanks to the river and the way the dashboard is set up, you will get new data every hour and the dashboard will refresh accordingly.

Happy hacking!

A word about open source:

We’re big fans of open source. We use many great open source products and also do our best to give back as much as we can. Checkout out the uberVU GitHub page for some of the project we’ve built. We’re constantly working and contributing to our own projects or new projects. If you want to contribute to our projects but don’t know where to start or you think that we can help a project in any way let us know!


uberVU is part of the Hootsuite team. Read more about it here