Search This Blog

Intermediate data spill in Mapreduce



As we know ,Mapreduce has two stages one is Map and second is Reduce. Map stage is responsible for filtering data and preparing the data and Reduce stage is responsible for aggregate operations and Join operations. Map output is written to disk and this operation is called spilling.
In this article, we are discussing important things happen in data spilling after map stage.


Map output is first written to buffer and buffer size is decided by io.sort.mb property .By default, it will be 100 MB.

When buffer reaches certain threshold ,It will start spilling   buffer data to disk. This threshold is decided by io.sort.spill.percent.

Before writing data onto Hard disk ,data is divided into partitions with respect to reducers.

On each Partition ,in-memory sort will be performed by key.

once per every three spills combiner will be run on sorted data if combiner function is specified.
These number of spills is decided by min.num.spills.for.combine.
after combiner function is performed, data is written to hard disk.

after completing writing of certain number of spills ,data will be merged into single file.
This number of spills is decided by io.sort.factor
By default, It is 10.

Below is picture that depicts the flow, hope it makes you understand better.

Data Flow while spilling map output