By Erik Gfesser — Aug 21, 2012

Chicago Big Data: Future of Data Processing with Apache Hadoop (August 21, 2012)

Personal notes from the Chicago Big Data meeting this past week at 1871 at Merchandise Mart, a tech hub for startups which opened just over 3 months ago, led by Bobby Evans, a developer and tech lead at Yahoo! Arun Murthy, founder and architect at Hortonworks, was planning to present with Bobby, but his flight was delayed and so was not able to make it to the meeting, and the Skype connection with him ended up being poor, so he did not speak. Bobby has been working with Arun for about 1.5 years.

title of presentation was "Future of Data Processing with Apache Hadoop"
Bobby gave an overview of Hadoop, discussed what YARN offers, and then looked down the road
there are two main components to classic Hadoop MapReduce: JobTracker and TaskTracker
a number of limitations exist in the current release of classic Hadoop MapReduce
current limitations include scalability, a single point of failure, and restart is very tricky due to complex state
the max cluster size is 4k nodes, with 40k max concurrent tasks
course synchronization in JobTracker
a failure kills all queued and running jobs, so jobs need to be resubmitted manually by users
current limitations also include a hard partition of resources into Map and Reduce slots, with low resource utilization
support for alternate paradigms is also lacking
iterative applications implemented using MapReduce are 10x slower
hacks exist for the likes MPI/graph processing
wire-compatible protocols are lacking
client and cluster must be of the same version
applications and workflows cannot migrate to different clusters
upgrading from one version of Hadoop to another is complicated: the cluster needs to be brought down, and if an upgrade for example of one node is missed, it is a disaster
Bobby argued that what is needed is reliability, availability, utilization, wire compatibility, scalability, agility, and evolution
enabling evolution is the ability of customers to control upgrades to the grid software stack
with YARN, the two major functions of JobTracker are split apart: cluster resource management and application lifecycle management
MapReduce becomes a "user-land" library, which Bobbie later defined as code that can be hacked as needed, giving the ability to innovate much more quickly
YARN is another application that can run on top of Hadoop
in place of the JobTracker is the resource manager, node manager, and application master
state becomes much simpler and scalable
the application manager is a special process that encompasses application logic, identifying where to run map tasks etc
for the most part, recompiled MapReduce jobs will work, but some things still need to be ironed out
YARN provides a generic resource model; the fixed partition Map and Reduce slot has been removed
YARN right now is on the side of too generic, acting very similarly to MapReduce 1.0
Bobby thinks YARN enables up to 10k nodes
in addition to utilization, there are also scalability improvements
Bobby noted that 6k machines in 2012 are greater than 12k machines in 2009
a resource manager is coming with 2.x, with no single point of failure, saved state in ZooKeeper, and application masters restarted automatically on RM restart
an application master has also been added, with optional failure via an application specific checkpoint
MapReduce applications pick up where they left off via saved state in HDFS
protocols are wire-compatible; old clients can talk to new servers, and rolling upgrades coming soon that permit failover to a new resource manager without downtime
some customers with whom Bobby has worked are afraid to upgrade because code may break, but YARN enables pointing to different versions of the resource manager
support for programming paradigms other than MapReduce include MPI, MAPREDUCE-3315, machine learning, iterative processing (Spark), realtime streaming (S4 and Storm coming soon), and graph processing (GIRAPH-13 coming soon)
these programming paradigms are enabled by permitting the use of a paradigm-specific application master
YARN is available in hadoop-0.23.2 release and hadoop-2.0.0 (alpha) release
YARN is coming soon to hadoop-0.23.3 release (expected to rollout in September 2012) and hadoop-2.1.0 (alpha) release (also coming shortly)
performance gains for HDFS read ops/sec of 60% more have been seen
MapReduce sorts have decreased 14%, with 17% higher throughput
shuffle runtime has decreased 30%, with shuffle time decreased 50%
small jobs run with Uber AM have seen 3.5x faster word count with 27.7x fewer resources
Uber AM checks to determine whether jobs are small, and if so runs them differently
MapReduce currently spawns new containers for each task, and YARN 2.x will reuse task slots/containers
while Hadoop interfaces are the same, the code underneath is a virtual rewrite
a lot of testing has been done, including regression testing and integration testing with HBase, Pig, Hive, Oozie, etc
YARN still has the concept of heartbeats
right now there is a week limit for a job to run before tokens go away
using heartbeats is really the only way to make sure a given job is alive

Subscribe to Erik on Software