Chicago Big Data: Future of Data Processing with Apache Hadoop (August 21, 2012)
Personal notes from the Chicago Big Data meeting this past week at 1871 at Merchandise Mart, a tech hub for startups which opened just over 3 months ago, led by Bobby Evans, a developer and tech lead at Yahoo! Arun Murthy, founder and architect at Hortonworks, was planning to present with Bobby, but his flight was delayed and so was not able to make it to the meeting, and the Skype connection with him ended up being poor, so he did not speak. Bobby has been working with Arun for about 1.5 years.
- title of presentation was "Future of Data Processing with Apache Hadoop"
- Bobby gave an overview of Hadoop, discussed what YARN offers, and then looked down the road
- there are two main components to classic Hadoop MapReduce: JobTracker and TaskTracker
- a number of limitations exist in the current release of classic Hadoop MapReduce
- current limitations include scalability, a single point of failure, and restart is very tricky due to complex state
- the max cluster size is 4k nodes, with 40k max concurrent tasks
- course synchronization in JobTracker
- a failure kills all queued and running jobs, so jobs need to be resubmitted manually by users
- current limitations also include a hard partition of resources into Map and Reduce slots, with low resource utilization
- support for alternate paradigms is also lacking
- iterative applications implemented using MapReduce are 10x slower
- hacks exist for the likes MPI/graph processing
- wire-compatible protocols are lacking
- client and cluster must be of the same version
- applications and workflows cannot migrate to different clusters
- upgrading from one version of Hadoop to another is complicated: the cluster needs to be brought down, and if an upgrade for example of one node is missed, it is a disaster
- Bobby argued that what is needed is reliability, availability, utilization, wire compatibility, scalability, agility, and evolution
- enabling evolution is the ability of customers to control upgrades to the grid software stack
- with YARN, the two major functions of JobTracker are split apart: cluster resource management and application lifecycle management
- MapReduce becomes a "user-land" library, which Bobbie later defined as code that can be hacked as needed, giving the ability to innovate much more quickly
- YARN is another application that can run on top of Hadoop
- in place of the JobTracker is the resource manager, node manager, and application master
- state becomes much simpler and scalable
- the application manager is a special process that encompasses application logic, identifying where to run map tasks etc
- for the most part, recompiled MapReduce jobs will work, but some things still need to be ironed out
- YARN provides a generic resource model; the fixed partition Map and Reduce slot has been removed
- YARN right now is on the side of too generic, acting very similarly to MapReduce 1.0
- Bobby thinks YARN enables up to 10k nodes
- in addition to utilization, there are also scalability improvements
- Bobby noted that 6k machines in 2012 are greater than 12k machines in 2009
- a resource manager is coming with 2.x, with no single point of failure, saved state in ZooKeeper, and application masters restarted automatically on RM restart
- an application master has also been added, with optional failure via an application specific checkpoint
- MapReduce applications pick up where they left off via saved state in HDFS
- protocols are wire-compatible; old clients can talk to new servers, and rolling upgrades coming soon that permit failover to a new resource manager without downtime
- some customers with whom Bobby has worked are afraid to upgrade because code may break, but YARN enables pointing to different versions of the resource manager
- support for programming paradigms other than MapReduce include MPI, MAPREDUCE-3315, machine learning, iterative processing (Spark), realtime streaming (S4 and Storm coming soon), and graph processing (GIRAPH-13 coming soon)
- these programming paradigms are enabled by permitting the use of a paradigm-specific application master
- YARN is available in hadoop-0.23.2 release and hadoop-2.0.0 (alpha) release
- YARN is coming soon to hadoop-0.23.3 release (expected to rollout in September 2012) and hadoop-2.1.0 (alpha) release (also coming shortly)
- performance gains for HDFS read ops/sec of 60% more have been seen
- MapReduce sorts have decreased 14%, with 17% higher throughput
- shuffle runtime has decreased 30%, with shuffle time decreased 50%
- small jobs run with Uber AM have seen 3.5x faster word count with 27.7x fewer resources
- Uber AM checks to determine whether jobs are small, and if so runs them differently
- MapReduce currently spawns new containers for each task, and YARN 2.x will reuse task slots/containers
- while Hadoop interfaces are the same, the code underneath is a virtual rewrite
- a lot of testing has been done, including regression testing and integration testing with HBase, Pig, Hive, Oozie, etc
- YARN still has the concept of heartbeats
- right now there is a week limit for a job to run before tokens go away
- using heartbeats is really the only way to make sure a given job is alive