Thursday, November 5, 2009

What Hadoop is good at

Hadoop is getting more popular these days. Lets look at what it is good at and what not.

The Map/Reduce Programming model
Map/Reduce offers a different programming model for handling concurrency than the traditional multi-thread model.

Multi-thread programming model allows multiple processing units (with different execution logic) to access the shared set of data. To maintain data integrity, each processing units co-ordinate their access to the shared data by using Locks, Semaphores. Problem such as "race condition", "deadlocks" can easily happen but hard to debug. This makes multi-thread programming difficult to write and hard to maintain. (Java provides a concurrent library package to ease the development of multi-thread programming)

Data-driven programming model feeds data into different processing units (with same or different execution logic). Execution is triggered by arrival of data. Since processing units can only access data piped to them, data sharing between processing units is prohibited upfront. Because of this, there is no need to co-ordinate access to data.

This doesn't mean there is no co-ordination for data access. We should think of the co-ordination is done explicitly by the graph. ie: by defining how the nodes (processing units) are connected to each other via data pipes.

Map-Reduce programming model is a specialized form of data-driven programming model where the graph is defined as a "sequential" list of MapReduce jobs. Within each Map/Reduce job, execution is broken down into a "map" phase and a "reduce" phase. In the map phase, each data split is processed and one or multiple output is produced with a key attached. This key is used to route the outputs (of the Map phase) to the second "reduce" phase, where data with the same key is collected and processed in an aggregated way.

Note that in a Map/Reduce model, parallelism happens only within a Job and execution between jobs are done in a sequential manner. As different jobs may access the same set of data, knowing that jobs is executed serially eliminate the needs of coordinating data access between jobs.

Design application to run in Hadoop is a matter of breaking down the algorithm in a number of sequential jobs and then exploit data parallelism within each job. Not all algorithms can fit in to the Map Reduce model. For a more general approach to break down an algorithm into parallel, please visit here.

Characteristics of Hadoop Processing
A detail explanation of Hadoop implementation can be found here. Basically Hadoop has the following characteristics ...
  • Hadoop is "data-parallel", but "process-sequential". Within a job, parallelism happens within a map phase as well as a reduce phase. But these two phases cannot run in parallel, the reduce phase cannot be started until the map phase is fully completed.
  • All data being accessed by the map process need to be freezed (update cannot happen) until the whole job is completed. This means Hadoop processes data in chunks using a batch-oriented fashion, making it not very suitable for stream-based processing where data flows in continuously and immediate processing is needed.
  • Data communication happens via a distributed file system (HDFS). Latency is introduced as extensive network I/O is involved in moving data around (ie: Need to write 3 copies of data synchronously). This latency is not an issue for batch-oriented processing where throughput is the primary factor. But this means Hadoop is not suitable for online access where low latency is critical.
Given the above characteristics, Hadoop is NOT good at the following ...
  • Perform online data access where low latency is critical (Hadoop can be used together with HBase or NOSQL store to deliver low latency query response)
  • Perform random ad/hoc processing of a small subset of data within a large data set (Hadoop is designed to scan all data in parallel)
  • Process small data volume (for data volume less than hundred GB range, many more mature solutions exist)
  • Perform real-time, stream-based processing where data is arrived continuously and immediate processing is needed (to keep the overhead small enough, typically data need to be batched for at least 30 minutes, which you won't be able to see the current data until 30 minutes has passed)

5 comments:

Anonymous said...

"The reduce phase cannot be started after the map phase is completed"

I think you meant:
"The reduce phase cannot be started until the map phase is completed"

Ricky Ho said...

You are right ! Thanks for pointing it out !

Ismael Juma said...

Hi,

See the following for a paper on introducing pipelining to Hadoop:

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

Best,
Ismael

Alex said...

Hi Ricky,

Thanks for your article. Despite the fact that it was written quite long time ago I still find it is interesting.

You mentioned that there are better solutions than Hadoop for processing small data sets (streams) in real time with small latency. Could you mentioned some of them, please?

Thanks.

Alex

sudheer said...

The information which you have provided is very good and easily understood.
It is very useful who is looking for Hadoop Online Training.