Hadoop Eco system research papers

You know , Hadoop eco-system has many tools , Every tool is an implementation of a research paper. Of course ,most of these research papers are written by Google employees.I would like to put most of these papers at one place in this article.

As You already know Hadoop has two core modules HDFS and MAPREDUCE.These two are open source implementations for Google products GFS and MAPREDUCE.

Below are their links .

1. GFS ( The Google File System).

2. MAPREDUCE : Simplified Data Processing Large Clusters.

Apache Hive is a  ware house created on top of Hadoop. It is an implementation of paper Peta byte scale data ware house using Hadoop.

Apache Pig is a platform for analyzing large data sets using data flow language Pig Latin.It is an implementation of paper Pig Latin: Not so foreign language for data processing.

Apache HBase is an open source implementation of Google's BigTable paper.

Apache Spark is an implementation of paper A fault tolerant abstraction for in-memory cluster computing

Apache Tez is an implementation of paper A Unifying Framework for Modeling and Building Data Processing Applications.

Apache Crunch is an implementation of Google's FlumeJava.

Apache Zookeeper is an implementation of paper wait free coordination for internet scale systems.

YARN is an implementation of Apache Hadoop YARN : Yet Another Resource Negotiator.

Apache Storm is an implementation of paper Storm @ Twitter.

Hope these papers are useful to you.

Intermediate data spill in Mapreduce

As we know ,Mapreduce has two stages one is Map and second is Reduce. Map stage is responsible for filtering data and preparing the data and Reduce stage is responsible for aggregate operations and Join operations. Map output is written to disk and this operation is called spilling.
In this article, we are discussing important things happen in data spilling after map stage.

Map output is first written to buffer and buffer size is decided by io.sort.mb property .By default, it will be 100 MB.

When buffer reaches certain threshold ,It will start spilling   buffer data to disk. This threshold is decided by io.sort.spill.percent.

Before writing data onto Hard disk ,data is divided into partitions with respect to reducers.

On each Partition ,in-memory sort will be performed by key.

once per every three spills combiner will be run on sorted data if combiner function is specified.
These number of spills is decided by min.num.spills.for.combine.
after combiner function is performed, data is written to hard disk.

after completing writing of certain number of spills ,data will be merged into single file.
This number of spills is decided by io.sort.factor
By default, It is 10.

Below is picture that depicts the flow, hope it makes you understand better.

Data Flow while spilling map output

Big Data quotations

I would like to share some big data quotations from famous people around the world.

Hope you enjoy reading them.

Below are some quotations .

Without big data, you are blind and deaf and in the middle of a freeway ---Geoffrey Moore

The world is one big data problem-- Andrew McAfee

For every two degrees the temperature goes up, check-ins at ice cream shops go up by 2%. --Andrew Hogue, Foursquare

Data scientists are new rock stars  --DJ PATIL

The value of Big Data lies not in the technology itself,but in the real world problems it can solve. ---HAMMERBACHER

Data scientists have the skills and expertise to transfer the planet for the better. --JEREMY HOWARD

Big Data could know better than we know ourselves. --DAN GARDNER

Today a street stall in mumbai can access more information ,maps,statistics academic papers,price trends and future markets , and data than a U.S. president could only a few decades ago.  ---JUAN ENRIQUEZ

We have reached a tipping point in history today more data is being manufactured by machines,servers and cell phones than by people. --MICHAEL E. DRISCOLL

Data is the new oil.

In god we trust,all others must bring the data.---Deming

Also please share your favorite big data quotation .

Environment for practising Hadoop

Most of hadoop aspirants would face common problem.That is how to setup all hadoop components so that they can practice hadoop ecosystem components like hive ,pig and oozie etc.Leading hadoop providers already have solution for this problem.
They have created virtual machines with all hadoop components installed. And you can directly use them.
But these need to be setup on top of existing operating system. In this article I am going to cover how to setup these VMs.
Below are prerequisites for this setup.

1). RAM 4GB and above

2). Hard disk 50 GB free space

3). Virtual technology .

By default most of operating systems come with virtual technology enabled.You can go to BIOS settings to check .You should see virtual technology enabled option under advanced settings. This might vary depending on vendors.If you see disabled please enable it.

4). 64 bit operating system

In this article I am talking about below vendors


1. Download Oracle virtualbox.

 download Oracle Virtual Box .
This will enable you to create virtual machines on top your existing Operating System.

2. Download Virtual VM

Download virtual vm that has all Hadoop components are already setup and You need not worry about installation of them.
Leading vendors provide their virtual vms for this purpose.
Download either Cloudera quick start vm or Hortonworks Sandbox .
For both ,Download vm related to only Oracle virtual box .
If you have different virtual machine softwares You can choose different version accordingly.

3. Extract virtual vm to a folder

4. Install virtual box

If you are on windows You can just double click it and It would take care of everything.
I mean to say it is straight forward.

4. Setup virtual machine

    4.1 Click on New
        specify some name
        Select Type as Linux
        Select version as Other Linux (64 bit)

       4.2 allocate around 2RAM to virual machine.

       4.3 Point to Virtual Image file extracted in above step.

Now You see one more vm added 

5. Start virtual machine

Click on Start Button it would start your virtual VM and You can use any hadoop component now.

All the best . Happy Hadooping.

World wide big data job market

In this weekend I did some research on big data jobs world wide.For this analysis I took big data job postings from most of the job portals for January 2015.I would like to share some of the insights I got on the same.This might help big data job seekers.

Below are top 10 countries  recruited in the world for Jan 2015.

United States

US tops this list holding 15 percentage of jobs.Australia (7 %) and Argentina (7%) take next places.and India (4%) is in ninth place.


In India ,Below are top 10 states which recruited on Big data in Jan 2015.

Andhra Pradesh
Tamil Nadu
Uttar Pradesh
West Bengal

Karnataka tops this list with around 50 percentage of jobs. Maharastra (16 %) and Andhra pradesh (8%)  take next places.

Below are job titles/positions mostly used/filled in Jan 2015.

Data Scientist
Data Analyst
Big Data Architect
Big Data Engineer
Business Analyst
Senior Data Scientist
Data Engineer
Hadoop Developer
Big Data Hadoop / NoSQL Developer

Data scientist position tops with around 10 percent of jobs.

In US,Below are top 10 states recruiting on big data.

New York

California tops this list with around 25 % of jobs,and New york  (8%) and Texas (6%) took next places .

Hope this article helps you to understand big data job market.

Good Luck for your big data job.