ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

Flexible Distributed Reporting for Millions of Publishers and Advertisers powered by Hadoop & Lucene

Dragan Milosevic

Audience level:
Intermediate
Track:
Lucene, Solr & Friends

Thursday 10:15 a.m.–10:45 a.m. in Level 1 Right

Description

Hadoop and Lucene proved to be winning combination for solving big-data reporting challenges at zanox. The former is used to offline analyze and extract valuable information from billions of tracking events. The latter provides sub-second online access to extracted data while serving millions of users. Together they optimize the usage of resources where Lucene requests help improve Hadoop jobs.

Abstract

Zanox tracking system produces more than 1 billion of different events daily. The collected data is analysed by Hadoop processing jobs that extract valuable information, being used to support daily business of publishers and advertisers. Lucene based distributed reporting afterwards offers a real-time access and the post-processing of the results of Hadoop jobs. The real-time access is based on Lucene state-of-the range and term filters to efficiently select records from the given time-period and that have other needed properties. These pre-selected records are then combined, joined and summarized in order to build easy-to-use reports.

The built reporting system is able to generate reports in sub-second time mainly by using sophisticated request-routing techniques that take care of the resource consumption. Instead of propagating the received request to all searchers, the implemented request-routing is carefully selecting searchers that are the most suitable for a given task. The suitability is measured by the number of records that need to be processed and the cost of post-processing that need to be performed. Such routing is both speeding-up processing of a particular request and keeping the majority of searchers free for concurrent requests.

The analysis of already processed requests provides data about how reporting system is performing. This is used for both tuning parameters of the used routing techniques, lunching additional searchers that are optimised for processing inefficient requests, and destroying others that happened to be underutilised due to the changed user interests. This log-analysis completes search, discovery and analytics loop, which constantly improves experience that customers have while using zanox distributed reporting system.