BigData World: Hadoop MapReduce Chaining

Small data processing tasks can be accomplished by a single MapReduce job but complex tasks need to be broken down into simpler subtasks, and each should be accomplished by an individual MapReduce job

Chaining MapReduce jobs in a sequence :

Two jobs can be executed manually one after the other, it’s more convenient to automate the execution sequence. You can chain MR jobs to run sequentially, with the output of one MapReduce job being the input to the next.

Chaining MapReduce jobs is analogous to Unix pipes.

mapreduce-1 | mapreduce-2 | mapreduce-3 | ...

Chaining MapReduce jobs involves calling the driver of one MapReduce job after another. The driver at each job will have to create a new JobConf object and set its input path to be the output path of the previous job. You can delete the intermediate data generated at each step of the chain at the end.

Chaining MapReduce jobs with complex dependency :

Sometimes the sub tasks of a complex data processing task don’t run sequentially, and their MapReduce jobs are therefore not chained in a linear fashion. For example,mapreduce1 may process one data set while mapreduce2 independently processes another data set. The third job, mapreduce3, performs an inner join of the first two jobs output.It’s dependent on the other two and can execute only after both mapreduce1 and mapreduce2 are completed. But mapreduce1 and mapreduce2 aren’t dependent on each other.

Hadoop has a mechanism to simplify the management of such (nonlinear) job dependencies via the Job and JobControl classes. A Job object is a representationof a MapReduce job. You instantiate a Job object by passing a JobConf object to its constructor. In addition to holding job configuration information, Job also holds dependency information, specified through the addDependingJob() method. For

Job objects x and y,

x.addDependingJob(y)

means x will not start until y has finished. Whereas Job objects store the configuration and dependency information, JobControl objects do the managing and monitoring of the job execution. You can add jobs to a JobControl object via the addJob() method.

After adding all the jobs and dependencies, call JobControl’s run() method to spawn a thread to submit and monitor jobs for execution. JobControl has methods like allFinished() and getFailedJobs() to track the execution of various jobs within the batch.

You can think of chaining MapReduce jobs, using the pseudo-regular expression:

[MAP | REDUCE]+

where a reducer REDUCE comes after a mapper MAP, and this [MAP | REDUCE] sequence can repeat itself one or more times, one right after another.

The analogous expression for a job using ChainMapper and ChainReducer would be

MAP+ | REDUCE | MAP*

The job runs multiple mappers in sequence to preprocess the data, and after running reduce it can optionally run multiple mappers in sequence to postprocess the data.The beauty of this mechanism is that you write the pre- and postprocessing steps as standard mappers. You can run each one of them individually if you want.

Let’s look at the signature of the ChainMapper.addMapper() method to understand in detail how to add each step to the chained job. The signature and function of ChainReducer.setReducer() and ChainReducer.addMapper() are analogous

public static <K1,V1,K2,V2> void addMapper(JobConf job,

Class<? extends Mapper<K1,V1,K2,V2>> klass,

Class<? extends K1> inputKeyClass,

Class<? extends V1> inputValueClass,

Class<? extends K2> outputKeyClass,

Class<? extends V2> outputValueClass,

boolean byValue,

JobConf mapperConf)

This method has eight arguments. The first and last are the global and local JobConf objects, respectively. The second argument (klass) is the Mapper class that will do the data processing. The four arguments inputValueClass, inputKeyClass, outputKeyClass, and outputValueClass are the input/output class types of the Mapper class.

Please see the next blog which demonstrates the usage of Chaining in MapReduce jobs

For any query please drop a comment......

BigData World

Wednesday, 19 December 2012

Hadoop MapReduce Chaining

Chaining MapReduce jobs with complex dependency :

1 comment: