Before starting with Map-Reduce framework, its very important to know few concepts, I’ve mentioned few here, more I will be adding soon. This can also serve for quick revision for those who already know. 🙂
- Map reduce framework works by breaking the computation into two parts or phases. Map and reduce phase. For both map/reduce phase Key-Value pair is input and Key-Value pair is output. During the map phase, programmer defined map function runs on the input Key-Value pairs and generate intermediate Key-Value pairs, and intermediate key value pairs are consumed by user defined reduce function, which again generates Key-Value pairs in output.
We can say : K1-V1 to Map -> K2,V2 (map output) -> Reduce -> K3, V3 final output.
- The intermediate output (o/p from map functions K2-V2 pairs) are processed by Hadoop MapReduce Framework before they are sent to the reducer. Before they are passed on to reducer framework sorts and groups them based on the key.
- Map function is a good place to drop the non matching or bad records. (This can also be done by writing custom RecordReader)
What is a Map Reduce Job?
In Map reduce job is a set of instructions or computation or processing which client wants to perform on the input data.
Map reduce job consists of three things:
- Map-Reduce Program (i.e Mapper, Reducer, Job classes)
- Input Data
- Configuration information.
Hadoop runs the given job by splitting into various “Tasks” which are executed in parallel on the cluster.
‘Nodes’ of Map-Reduce:
There are two types of daemons which are required to run any MR job.
- Job Tracker (Master process)
- Task Tracker (Slave process)
Job Tracker manages the job execution on the cluster by scheduling the tasks to run on the task trackers. Task trackers are the actual work horses of the system on which computation will happen.
Any task tracker can run two kind of task
- Map task
- Reduce task
All the task trackers report to job-tracker periodically by sending their status (progress) messages, so that JobTracker can keep track of the overall progress of the given job.
If any task tracker goes down, then JobTracker reschedules the ‘failed’ task on other task tracker.
What is a split ?
Input data for a single map task is known as input split. Hadoop runs a map task for the given split and run the programmer defined Map function for each input record of the split.
By default HDFS block size is 64 MB. Typically input split size for the job is also kept as 64 MB.
Generally they are matched to guarantee that data required by the map function is processed locally.
For example HDFS block size = 32 MB and input split size = 128MB , that means we need four blocks of data for a single map task, since hadoop doesn’t write all the blocks on the same node, means map task might have to copy 3 blocks from other nodes, which leads to network traffic.
If split size = block size , say for example both are 64 MB, that means its guaranteed that for the map task data will be found locally.
Hadoop tries to its maximum to run the map task locally (on the node where input data is stored) , this helps to reduce cluster bandwidth wastage. Reduce function cannot take advantage of data locality as they are fed from each mapper.
Map functions outputs (K2-V2 key value pairs) are written on the local hard disk of the machine and not on HDFS, as its intermediate output and its discarded once reduce function is over. If we write the intermediate output to the HDFS with replication then we are wasting lot of disk space + network bandwidth.
Reduce function’s outputs are written on HDFS unlike map outputs.
Generally if we have ‘m’ maps and ‘r’ reducers then output of each map will be fed to each reducers. Hence m*r distinct copy operation will happen.
What if my map function finished processing but reducer could not copy output ?
In this scenario, JobTracker will re-run the map task on some other task tracker. And map function will re-create the output.
Number of Map tasks = Number of input splits = Number of blocks ( if size of split = size of HDFS block)
Number of reducer = ? No, its not decided by splits, it be specified on command prompt or in configuration xml or through API in driver class. (more on this later)
What is a combiner?
A combiner can be thought of as ‘Map side reducer’ or ‘local reducer’ .
To optimize the network bandwidth, Hadoop provides facility to specify the combiner. Combiner runs on each map’s output and output of combiner becomes the input of the reducer. Signature and the interface to be implemented for combiner is same as that of Reducer. Many times we can use Reducer class as is for combiner.
Its important to know check if we can use the combiner for the MR job, this can drastically improve performance as on the reducer side there will be less work to be done for sorting and grouping.
Hadoop doesn’t provide any guarantee that our combiner will run or not. Or if it runs how many times it will run. So, we need to make sure that our job doesn’t rely on combiner.
When we have more than one reducer, its important to distribute the load equally and more importantly, we need to make sure that all the values for any given key goes to a single reducer.
For example :
We have 10 records, which have the same key : a, it is useless if we send some of the records in one reducer and some in other reducers, all 10 records have to be reduce-ed in a single function.
By default MR framework uses default HashPartitioner. We can over-ride the partioner by writing our own partitioner. (Generally the case with secondary sorting and reduce side joins).
Map only jobs:
For some jobs, which can be executed in parallel and output can be generated only by Map functions, those kind of jobs are known as Map only job. We can make map only job by specifying number of reducers = 0 (on command prompt while running the job or through API during writing driver class).
For the map only jobs, output of the map functions is written and replicated on HDFS.