Small
data processing tasks can be accomplished by a single MapReduce
job but complex tasks need to be broken down into simpler subtasks,
and each should be accomplished by an individual MapReduce job
Chaining MapReduce jobs in a sequence :
Two jobs can be executed manually one after the other, it’s more convenient to automate the execution
sequence. You can chain MR jobs to run sequentially, with the output of
one MapReduce job being the input to the next.
Chaining MapReduce jobs is
analogous to Unix pipes.
mapreduce-1 | mapreduce-2
| mapreduce-3 | ...
Chaining MapReduce jobs
involves calling the driver of one MapReduce job after another. The
driver at each job will have to create a new JobConf object and set
its input path to be the output path of the previous job. You can
delete the intermediate data generated at each step of the chain at the end.
Chaining MapReduce jobs with complex dependency :
Sometimes the sub tasks of
a complex data processing task don’t run sequentially, and their MapReduce jobs are
therefore not chained in a linear fashion. For example,mapreduce1 may process one
data set while mapreduce2 independently processes another data set. The third
job, mapreduce3, performs an inner join of the first two jobs output.It’s dependent
on the other two and can execute only after both mapreduce1 and
mapreduce2 are completed. But mapreduce1 and mapreduce2 aren’t
dependent on each other.
Hadoop has a mechanism to
simplify the management of such (nonlinear) job dependencies via the
Job and JobControl classes. A Job object is a representationof a MapReduce job. You
instantiate a Job object by passing a JobConf object to its constructor. In addition
to holding job configuration information, Job also holds dependency information,
specified through the addDependingJob() method. For
Job objects x and y,
x.addDependingJob(y)
means x will not start
until y has finished. Whereas Job objects store the configuration and dependency
information, JobControl objects do the managing and monitoring of the job execution. You can
add jobs to a JobControl object via the addJob() method.
After adding all the jobs
and dependencies, call JobControl’s run() method to spawn a thread to submit and
monitor jobs for execution. JobControl has methods like allFinished() and
getFailedJobs() to track the execution of various jobs within the
batch.
You can think of chaining
MapReduce jobs, using the pseudo-regular expression:
[MAP | REDUCE]+
where a reducer REDUCE
comes after a mapper MAP, and this [MAP | REDUCE] sequence can repeat itself
one or more times, one right after another.
The analogous expression for a job using
ChainMapper and ChainReducer would be
MAP+ | REDUCE | MAP*
The job runs multiple
mappers in sequence to preprocess the data, and after running reduce it can optionally
run multiple mappers in sequence to postprocess the data.The beauty of this
mechanism is that you write the pre- and postprocessing steps as standard mappers. You can
run each one of them individually if you want.
Let’s look at the
signature of the ChainMapper.addMapper() method to understand in detail how to add each
step to the chained job. The signature and function of ChainReducer.setReducer()
and ChainReducer.addMapper() are analogous
public static
<K1,V1,K2,V2> void addMapper(JobConf job,
Class<? extends
Mapper<K1,V1,K2,V2>> klass,
Class<? extends K1>
inputKeyClass,
Class<? extends V1>
inputValueClass,
Class<? extends K2>
outputKeyClass,
Class<? extends V2>
outputValueClass,
boolean byValue,
JobConf mapperConf)
This method has eight
arguments. The first and last are the global and local JobConf objects,
respectively. The second argument (klass) is the Mapper class that will do the data
processing. The four arguments inputValueClass, inputKeyClass, outputKeyClass, and
outputValueClass are the input/output class types of the Mapper class.
Please see the next blog which demonstrates the usage of Chaining in MapReduce jobs
For any query please drop a comment......
For any query please drop a comment......