ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

Enabling Elastic, Multi-tenant, Highly Available Hadoop on Demand

Richard McDougall

Audience level:
Beginner
Track:
Big Data

Tuesday 10:15 a.m.–11 a.m. in Level 2 Left

Description

Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been realized. In this session, learn how virtualization brings the advantages of greater elasticity, isolation for multi-tenancy, and HA protection to Hadoop, while preserving comparable performance to Hadoop on physical machines.

Abstract

Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been realized. Hadoop is the tool of choice for Big Data analytics, but operators face challenges with provisioning and scaling clusters, reliability, and low CPU utilization. The classic benefits of virtualization-—elasticity, multi-tenancy and high availability—-appear well-suited to meeting these challenges. This session explores the benefits and implications of virtualizing Hadoop.

We will discuss Serengeti, an open-source project that enables the rapid deployment, configuration, and management of a Hadoop cluster in a virtual environment. The Hadoop cluster can also benefit from the failover protection available to virtual machines.

Leveraging virtual machines, we will describe how Hadoop compute and data layers can be decoupled to achieve elasticity, support multi-tenancy, and provide the ability to run mixed workloads.

Performance benchmarks indicate that running Hadoop on virtual machines provides comparable performance to Hadoop on physical machines, so the benefits of virtualization outlined above come at little to no cost.