Search This Blog

Analysis of Cloudera customer case studies

Cloudera is a leading big data vendor in the world. It is providing Hadoop support and training services to it's customers.  I always used to think how big data customers are deciding a vendor.
So for fun, I have downloaded all customer case studies of Cloudera . And I did analysis to generate a report about what they are saying in case study.

I have categorized case study text into four topics.

1). What are most commonly mentioned benefits for Cloudera?
2). What were they mostly using before Hadoop? Of course some business use cases are new ones.
3). What is mostly used technology in Cloudera customers?
4). Which Domains  customers  are mostly using Cloudera services?

Domains :

Cloudera  customers are from domains health care, BFSI, Digital marketing ,education and security etc..

So most of the customers are from below three domains.

Heath Care
Digital marketing

It is good to know that second most customers are from health care domain. Some thing good happening to humans with big data. Interesting use case is real time monitoring of kids in hospitals. Seems Nationl Children hospital has improved patient care for kids using hadoop .

Migration :
                    many customers did not mention what they were using before hadoop. And some customers have fresh use cases for hadoop. Some customers mentioned below technologies as their old technology stack.

Data Warehouse

Many customers say their rdbms could not deal with big data. Customers mentioned they were using Oracle,SQL Server,DB2 and MySQL.

Mostly used technology :

Cloudera customers say they are using batch processing ,real time ,ETL and visualization tools of hadoop eco system.

Mostly mentioned technologies are :

Surprised to know most of the customers say they are using Hive ,Flume ,Mapreduce and HBase.

Benefits :

Let us see why customers  are choosing Cloudera over others.

Seems Hadoop itself is a cost-effective solution. Though Cloudera is relatively costlier than Mapr and Hortonworks. Still customers say it is cost-effective ,I think they are comparing against data warehouse and other solutions. Cloudera is also well know for training and support services.

Hope this is useful for you. Please check about MapR here.

Analysis of MapR customer case studies

For fun, I just wanted to do analysis of vendors like MapR,Hortonworks and Cloudera. So started with MapR. I am interested to know what benefits customer are getting or expecting from a vendor? 

Tools they are implementing. 

Domains of the customer .

Migration if they are doing any.

And other benefits.

I have downloaded customer case studies from their official website and analyzed all of them to come with below reports.

Domains :

MapR customers are from almost all domains but most of them are from digital marketing and health care.
Digital marketing has more number of customers. Next is health care domain.

Digital marketing.
Health Care.

Mostly Used Tool :

I have seen many of customers have use cases for real time analytics in MapR customer list. seems MapR-DB is able to attract many customers .Customers also using storm ,Kafka and Spark.
Below are three mostly used tools in MapR customers, MapR-DB is number one next comes Mapreduce and Kafka.


I would like to know why and how customers are choosing vendors .Most of MapR customers mentioned below three benefits. seems MapR has established unique brand with them.

NFS Support
Better performance

NFS is the unique feature customers getting from MapR.

Migration :

It is always interesting to know what were customers using before Hadoop. Though many use cases are fresh ones. Many of the customers mentioned they migrated their technology stack from RDBMS to Hadoop.Oracle and SQL Server are mostly mentioned technologies.


Hope this is useful for you. Please check about Cloudera here.

Hadoop eco system books to read

Even though Hadoop is a 10 years old technology still you will find less number of resources to learn it. They are different reasons for that. One of them is , Hadoop is rapidly changing technology and many people might have not tried all features of it. and some times It is also not ready for enterprise use cases.

I would like to put list of Hadoop eco system books at one place in this article.

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

This is very good book on Predictive analytics for beginners. This help you to build understanding about predicitve analytics.

Hadoop The definitive guide :

First and fore most important book every body should read on Hadoop is Hadoop the definitive guide. It not only covers HDFS and Mapreduce but also mapreduce abstractions like cascading,hive and pig too. Apart from that it covers both aspects development and administration of technology .It also has latest features covered in latest edition. It also covers certification syllabus for both Hortonworks and Cloudera. Book is organised well and covers quality content and examples.
However if you want to have complete understanding of a specific tool .Then you may have to check another book on that tool. This book covers all features of HDFS and Mapreduce but only core features of other tools in eco system. Spark is also covered in latest edition.

Hadoop in practise :

This book covers in depth topics of HDFS and Mapreduce with very good coding examples.
Apart from HDFS and Mapreduce ,It also covers SQL tools like Hive , Impala and Spark SQL.

Hadoop operations

This book is very good book on administration operations.It covers installation and configuration of hadoop daemons. Operating system  and Network details are also covered as part of cluster planning topics. This is a small book but covers quality content on administration.
for administration you may be interested to refer this book along with Hadoop definitive guide.

Eco system tools:

This is a very good book on Apache Hive . It almost covers all topics of Hive. Best part is It also covers most difficult features of hive in an understandable explanation.If you want to master UDFs and UDAFs , You can depend on it.

Small book covers Apache Pig. Author has very good experience on Apache Pig. Editorial work is not done properly.There is a scope for improvement in this book. 

Cascading is most useful tool in hadoop eco system. It has very good documentation on its home page. To know more practical applications and different analytical capabilities  of Cascading framework, This book is very useful.

If you want to know, why or where RDBMS is not relevant in big data applications and How HBase addresses the problems of big data , is well covered in this book.  It is useful for both development and administration of Hbase.

Below are other books available on utility tools like Sqoo, Oozie and Flume. I have not read these books also we do not have options for them as of now.

Apache Sqoop Cookbook :

Apache Oozie : the Workflow scheduler for Hadoop

Using Flume : Flexible, Scalable ad Reliable Data Streaming

Security Books :

The following books are some security books on Hadoop.

Hadoop Security by Ben Spivey

This book gives us good theoratical explain about security.

Kerberos is very improtant service in Hadoop eco system for network security.

The following book is a best book for learning Kerberos.

Kerberos: The Definitive Guide

Prerequisite for learning Apache Hadoop

Apache Hadoop is a framework, that is used for processing large data sets on commodity hardware.
It has two core modules ,HDFS and Mapreduce. HDFS is used for data storage and Mapreduce is used for data processing. Hadoop has became de facto standard for processing large data sets.
As it is widely used in the companies ,now a days, every body is trying to learn Apache Hadoop .
In this article ,I will discuss prerequisites for learning Hadoop.

Hadoop eco system has so many tools and growing fast day by day.Main tools are HDFS and Mapreduce. Both are written in Java. Many technologies are written on top of Mapreduce. For example Apache Hive ,Apache Pig,Cascading and Crunh etc..These technologies also developed using Java. Apache HBase is a NOSQL Database .Apart from these ,we also have small tools like Apache SQOOP and Apache Oozie. These too written using Java.

Below Skill set is required for learning Apache Hadoop.

Programming Language:

Most of the technologies in Hadoop Eco system are written using Java . Hadoop also has support for several programming languages.  We can use AWK and SED as part of Streaming. C/C++ can be used as part of Pipes. Python can also be used for data processing ,again Streaming can support it.
Java is widely used in Hadoop.Python is also often used . Scala is also used after of success of Spark.


Apache Hive query language is almost same as ANSI SQL. Apache Pig has many commands similar to sql .For example ,order by ,group by and joins also available in Apache Pig. Similarly Same commands are also available in Casscading. of course they are Java classes in it.HBase too has some commands similar to SQL commands.Not only Hadoop eco system tools but also many big data tools provide SQL interface so that people can easily learn it.Cassandra is also same as SQL.

Operating System:

You need to have good OS skills. Most of the time, Unix based operating systems are used in production. So if you know any Unix based OS ,Your day to day life will become easy.If you also know shell script You can achieve good productivity.


 Apache Sqoop has simple commands ,One can easily learn it. Apache Oozie applications are written using XML files. and almost every technology comes with REST API. and some REST APIs give JSON output. as all of these tools are build for parallel computing, It is better have an understanding about different parallel computing technologies.  Last but not least ,one needs to have good debugging/trouble shooting skills to resolve one's day to day issues . Otherwise you may spend several days on a single problem.
Feel free to contact me if you have any other questions.

Introduction to Apache Knox

Hadoop Eco System has many tools as you already know.Some of them are HDFS , Hive, Oozie and Falcon etc.All these tools provide  REST API so that other tools can communicate with them. Every tool will have hostname and port number as part of their REST API URL.With respect to security , It is not good practice to expose internal host names and port numbers. Some body might try to attack using them.

To address this problem , we have a security tool called Apache Knox. Apache Knox is a REST API based gateway security tool that provides perimeter security for all Hadoop services.

Apache Knox hides REST API URLs of  all hadoop services for external hadoop clients.They will only use REST API provided by Apache Knox . Knox will delegate external hadoop client requests to corresponding hadoop services. And before delegating hadoop client requests , Knox provides all security services configured on the cluster.

Below are some more important points of Apache Knox.
  • Demo LDAP is by default available for Apache Knox.
  • Kerberos is optional for Apache Knox but can easily be integrated with knox. 
  • External clients need not remember all REST API URLs of all hadoop services.
  • Provides Audit log
  • Provides authorization even including service level authorization 


Difference between Apache Hive and Apache Pig

MapReduce follows key-value programming model. It has two core stages Map and Reduce.
Both Map and Reduce have key-value as an input and key-value as an output. To write Map Reduce applications ,we need to know one programming language like Java.
These MapReduce applications will have a Map program , a reduce program and a driver program to run map and reduce programs.We need to create a jar containing these programs to process the data.

This Mapreduce has lengthy development time and may not be suitable for situations like adhoc querying. That is one of the reasons there are so many abstractions available for Mapreduce.
For example Cascading, Apache Crunch, Apache Hive and Apache Pig etc...All of these hide key-value complexity for developer. We will now discuss differences between Apache Hive and Apache Pig.

Apache Hive       VS   Apache Pig

Types of Data they support

Apache Hive :  

Hive is a scalable data warehouse on top of Apache Hadoop. As data is available in tables it only supports structured data . processing semi structured data is difficult and processing unstructured data is very very difficult.

Apache Pig :

Pig is a platform for processing large data sets. Its query language is called Pig latin. Pig latin can process structured ,semi structured and unstructured data.

Programming model

Apache Hive :  Hive query language is declarative programming language. It is not easy to build complex business logic.

Apache Pig : Pig Latin is an imperative programming language , You can easily write complex business logic.


Apache Hive :   Hive has a component called HCatalog that provides cross platform schema.
It also has Rest API called WebHCatalog. So You can integrate any tool with Apache Hive.
Already Teradata, Aster Data got integrated with apache Hive. Even Pig can process data using WebHCatalog.

Apache Pig : It does not have any such feature. Because it is processing platform not a storage platform.


Apache Hive : We can debug hive queries but not that easy.

Apache Pig : Pig Latin is a data flow language It is designed keeping debugging feature in mind.

So We can easily debug Pig Latin scripts.


Both can be easily learned . Hive is almost same as SQL. Pig Latin also looks like SQL .

One can easily learn hive and start writing queries to process data.

Industry Adoption

Apache Hive : It is more widely used in the industry than Apache Pig. 

Adhoc Querying

Both can be used for adhoc querying Hive is more suitable than Pig if it is structured data.

Complex Business logic

If you have to develop applications that have so much business complexity. It is better to use Apache Pig rather than using Hive.

Pig is widely used in research applications than Hive for the same reason.

Let me know if you want to compare these two for any other use-case.

Error Categories in Apache Pig

When you are working on Apache Pig you might see some error codes with some error description.
I would like to discuss those error codes categories so that It becomes easy to progress  for error resolution.

Apache Pig categorizes error codes into four groups . They are INPUT,BUG,USER ENVIRONMENT and REMOTE ENVIRONMENT.

If error code is greater than or equals to 100 and less than or equals to 1999 ,They fall in INPUT group.

For Example :

Error code 1000 is thrown if Pig Latin script input is not parse-able.

and also Error Code 1005 is thrown when we are trying to describe relation which does not have  a n input schema.

If error codes is greater than or equals to 2000 and less than or equals to 2999. They fall under BUG group. all of these are run time errors .

For example Error Code 2009 is thrown when copy operation is failed .

If error codes is greater than or equals to 3000 and less than or equals to 4999. They fall under USER ENVIRONMENT group.

For example : Error code 4002 is thrown when program fails to read data from a file. because there is a problem in User environment.

If error codes is greater than or equals to 5000 and less than or equals to 6999. They fall under REMOTE ENVIRONMENT group.

For example error code 6002 is thrown when out of memory occurs on cluster.

Hadoop Eco system research papers

You know , Hadoop eco-system has many tools , Every tool is an implementation of a research paper. Of course ,most of these research papers are written by Google employees.I would like to put most of these papers at one place in this article.

As You already know Hadoop has two core modules HDFS and MAPREDUCE.These two are open source implementations for Google products GFS and MAPREDUCE.

Below are their links .

1. GFS ( The Google File System).

2. MAPREDUCE : Simplified Data Processing Large Clusters.

Apache Hive is a  ware house created on top of Hadoop. It is an implementation of paper Peta byte scale data ware house using Hadoop.

Apache Pig is a platform for analyzing large data sets using data flow language Pig Latin.It is an implementation of paper Pig Latin: Not so foreign language for data processing.

Apache HBase is an open source implementation of Google's BigTable paper.

Apache Spark is an implementation of paper A fault tolerant abstraction for in-memory cluster computing

Apache Tez is an implementation of paper A Unifying Framework for Modeling and Building Data Processing Applications.

Apache Crunch is an implementation of Google's FlumeJava.

Apache Zookeeper is an implementation of paper wait free coordination for internet scale systems.

YARN is an implementation of Apache Hadoop YARN : Yet Another Resource Negotiator.

Apache Storm is an implementation of paper Storm @ Twitter.

Hope these papers are useful to you.

Intermediate data spill in Mapreduce

As we know ,Mapreduce has two stages one is Map and second is Reduce. Map stage is responsible for filtering data and preparing the data and Reduce stage is responsible for aggregate operations and Join operations. Map output is written to disk and this operation is called spilling.
In this article, we are discussing important things happen in data spilling after map stage.

Map output is first written to buffer and buffer size is decided by io.sort.mb property .By default, it will be 100 MB.

When buffer reaches certain threshold ,It will start spilling   buffer data to disk. This threshold is decided by io.sort.spill.percent.

Before writing data onto Hard disk ,data is divided into partitions with respect to reducers.

On each Partition ,in-memory sort will be performed by key.

once per every three spills combiner will be run on sorted data if combiner function is specified.
These number of spills is decided by min.num.spills.for.combine.
after combiner function is performed, data is written to hard disk.

after completing writing of certain number of spills ,data will be merged into single file.
This number of spills is decided by io.sort.factor
By default, It is 10.

Below is picture that depicts the flow, hope it makes you understand better.

Data Flow while spilling map output

Big Data quotations

I would like to share some big data quotations from famous people around the world.

Hope you enjoy reading them.

Below are some quotations .

Without big data, you are blind and deaf and in the middle of a freeway ---Geoffrey Moore

The world is one big data problem-- Andrew McAfee

For every two degrees the temperature goes up, check-ins at ice cream shops go up by 2%. --Andrew Hogue, Foursquare

Data scientists are new rock stars  --DJ PATIL

The value of Big Data lies not in the technology itself,but in the real world problems it can solve. ---HAMMERBACHER

Data scientists have the skills and expertise to transfer the planet for the better. --JEREMY HOWARD

Big Data could know better than we know ourselves. --DAN GARDNER

Today a street stall in mumbai can access more information ,maps,statistics academic papers,price trends and future markets , and data than a U.S. president could only a few decades ago.  ---JUAN ENRIQUEZ

We have reached a tipping point in history today more data is being manufactured by machines,servers and cell phones than by people. --MICHAEL E. DRISCOLL

Data is the new oil.

In god we trust,all others must bring the data.---Deming

Also please share your favorite big data quotation .

Environment for practising Hadoop

Most of hadoop aspirants would face common problem.That is how to setup all hadoop components so that they can practice hadoop ecosystem components like hive ,pig and oozie etc.Leading hadoop providers already have solution for this problem.
They have created virtual machines with all hadoop components installed. And you can directly use them.
But these need to be setup on top of existing operating system. In this article I am going to cover how to setup these VMs.
Below are prerequisites for this setup.

1). RAM 4GB and above

2). Hard disk 50 GB free space

3). Virtual technology .

By default most of operating systems come with virtual technology enabled.You can go to BIOS settings to check .You should see virtual technology enabled option under advanced settings. This might vary depending on vendors.If you see disabled please enable it.

4). 64 bit operating system

In this article I am talking about below vendors


1. Download Oracle virtualbox.

 download Oracle Virtual Box .
This will enable you to create virtual machines on top your existing Operating System.

2. Download Virtual VM

Download virtual vm that has all Hadoop components are already setup and You need not worry about installation of them.
Leading vendors provide their virtual vms for this purpose.
Download either Cloudera quick start vm or Hortonworks Sandbox .
For both ,Download vm related to only Oracle virtual box .
If you have different virtual machine softwares You can choose different version accordingly.

3. Extract virtual vm to a folder

4. Install virtual box

If you are on windows You can just double click it and It would take care of everything.
I mean to say it is straight forward.

4. Setup virtual machine

    4.1 Click on New
        specify some name
        Select Type as Linux
        Select version as Other Linux (64 bit)

       4.2 allocate around 2RAM to virual machine.

       4.3 Point to Virtual Image file extracted in above step.

Now You see one more vm added 

5. Start virtual machine

Click on Start Button it would start your virtual VM and You can use any hadoop component now.

All the best . Happy Hadooping.

World wide big data job market

In this weekend I did some research on big data jobs world wide.For this analysis I took big data job postings from most of the job portals for January 2015.I would like to share some of the insights I got on the same.This might help big data job seekers.

Below are top 10 countries  recruited in the world for Jan 2015.

United States

US tops this list holding 15 percentage of jobs.Australia (7 %) and Argentina (7%) take next places.and India (4%) is in ninth place.


In India ,Below are top 10 states which recruited on Big data in Jan 2015.

Andhra Pradesh
Tamil Nadu
Uttar Pradesh
West Bengal

Karnataka tops this list with around 50 percentage of jobs. Maharastra (16 %) and Andhra pradesh (8%)  take next places.

Below are job titles/positions mostly used/filled in Jan 2015.

Data Scientist
Data Analyst
Big Data Architect
Big Data Engineer
Business Analyst
Senior Data Scientist
Data Engineer
Hadoop Developer
Big Data Hadoop / NoSQL Developer

Data scientist position tops with around 10 percent of jobs.

In US,Below are top 10 states recruiting on big data.

New York

California tops this list with around 25 % of jobs,and New york  (8%) and Texas (6%) took next places .

Hope this article helps you to understand big data job market.

Good Luck for your big data job.

Word Count using Cascading

In this Post we learn how to write word count program using Cascading.And we run it in local mode on windows within Eclipse.Cascading is a platform for developing big data applications on Hadoop.It has got many benefits over other mapreduce based tools.Cascading has plumbing terminology (taps,pipes etc ...)  to develop applications.

Assume we have input data in c:\data\in\data.txt

This is a Hadoop input file
Hadoop is a bigdata technology 

1 . Define input details

We define input details like path and schema using source tap.Cascading reads the data from source tap .Here We use FileTap to read data from local takes file path and scheme (schema or columns).

String inputPath = "c:\\data\\in\\data.txt";
Tap srctap = new FileTap( new TextLine( new Fields("line" )) , inputPath );

2. Convert line into wrods.

Now we convert line data into words by applying regex filter. 


Regex filter converts line data into words using Space delimiter.We use RegexSplitGenerator function withing Each filter.start is the name of it.

Pipe words=new Each("start",new RegexSplitGenerator("\\s+"));

3. Apply group by

We use groupby pipe class to convert words into groups.

Pipe group=new GroupBy(words);

We apply this GroupBy on our last Pipe words.

4. Calculate Count of words.

Count count=new Count();
Pipe wcount=new Every(group, count);

We use count function of Cascading to generate count of words here. and we are applying it on our last group.

 a 2
bigdata 1 
 Hadoop 2
file 1
 input 1
 is 2 
technology 1 
This 1

5. Declare output path.

Now We have generated count and we have to write that output to a path.

String outputPath = "c:\\data\\out";
Tap sinkTap =new FileTap(  new TextLine( new Fields("word" ,"count")), outputPath, SinkMode.REPLACE );

With in FileTap we have declared columns word and count , output path and sink mode as replace.

we are asking cascading to replace existing data of output path .

6. Set properties.

We set main class using AppProps class.

Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, WordCount.class );

7. Create a flow

We create a flow by connecting source tap,sink tap and last operation we performed.

we use LocalFlowConnector  to run the program in local mode.

LocalFlowConnector flowConnector = new LocalFlowConnector();
Flow flow = flowConnector.connect( "wordcount", srctap, sinkTap, wcount );

Below are jar files required for running this program in eclipse.

Displaying jars.png

Below is the complete program for the same.

import java.util.Properties;

import cascading.flow.Flow;
import cascading.flow.local.LocalFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexSplitGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.scheme.local.TextLine;
import cascading.tap.SinkMode;
import cascading.tap.Tap;
import cascading.tap.local.FileTap;
import cascading.tuple.Fields;

/** * Wordcount example in Cascading */
public class WordCount {
public static void main(String[] args) {
String inputPath = "c:\\data\\in\\data.txt";
String outputPath = "c:\\data\\out";
Tap srctap = new FileTap( new TextLine( new Fields("line" )) , inputPath );
Tap sinkTap =new FileTap(  new TextLine( new Fields("word" ,"count")), outputPath, SinkMode.REPLACE );
Pipe words=new Each("start",new RegexSplitGenerator("\\s+"));
Pipe group=new GroupBy(words);
Count count=new Count();
Pipe wcount=new Every(group, count);
Properties properties = new Properties();

AppProps.setApplicationJarClass( properties, WordCount.class );
LocalFlowConnector flowConnector = new LocalFlowConnector();

Flow flow = flowConnector.connect( "wordcount", srctap, sinkTap, wcount );


Word Count in Pig Latin

In this Post, we learn how to write word count program using Pig Latin.

Assume we have data in the file like below.

This is a hadoop post
hadoop is a bigdata technology

and we want to generate output for count of each word like below


Now we will see in steps how to generate the same using Pig latin.


1.Load the data from HDFS

Use Load statement to load the data into a relation .
As keyword used to declare column names, as we dont have any columns, we declared only one column named line.
input = LOAD '/path/to/file/' AS(line:Chararray);

2. Convert the Sentence into words.

The data we have is in sentences. So we have to convert that data into words using
TOKENIZE Function.

If we have any delimeter like space we can specify as
(TOKENIZE(line,' '));

Output will be like this:


but we have to convert it into multiple rows like below


3.Convert Column into Rows

I mean we have to convert every line of data into multiple rows ,for this we have function called
FLATTEN in pig.

Using FLATTEN function the bag is converted into tuple, means the array of strings
converted into multiple rows.

Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;

Then the ouput is like below


3. Apply GROUP BY

We have to count each word occurance, for that we have to group all the words.

Grouped = GROUP words BY word;

4. Generate word count 
wordcount = FOREACH Grouped GENERATE group, COUNT(words);

We can print the word count on console using Dump.

DUMP wordcount;

Output will be like below.


Below is the complete program for the same.

input = LOAD '/path/to/file/' AS(line:Chararray);
Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;
Grouped = GROUP words BY word;
wordcount = FOREACH Grouped GENERATE group, COUNT(words); 

You may check same word count using Hive .