Environment for practising Hadoop

Most of hadoop aspirants would face common problem.That is how to setup all hadoop components so that they can practice hadoop ecosystem components like hive ,pig and oozie etc.Leading hadoop providers already have solution for this problem.
They have created virtual machines with all hadoop components installed. And you can directly use them.
But these need to be setup on top of existing operating system. In this article I am going to cover how to setup these VMs.
Below are prerequisites for this setup.


1). RAM 4GB and above


2). Hard disk 50 GB free space


3). Virtual technology .

By default most of operating systems come with virtual technology enabled.You can go to BIOS settings to check .You should see virtual technology enabled option under advanced settings. This might vary depending on vendors.If you see disabled please enable it.


4). 64 bit operating system


In this article I am talking about below vendors


Cloudera
Hortonworks


1. Download Oracle virtualbox.


 download Oracle Virtual Box .
This will enable you to create virtual machines on top your existing Operating System.


2. Download Virtual VM


Download virtual vm that has all Hadoop components are already setup and You need not worry about installation of them.
Leading vendors provide their virtual vms for this purpose.
Download either Cloudera quick start vm or Hortonworks Sandbox .
For both ,Download vm related to only Oracle virtual box .
If you have different virtual machine softwares You can choose different version accordingly.


3. Extract virtual vm to a folder

4. Install virtual box


If you are on windows You can just double click it and It would take care of everything.
I mean to say it is straight forward.




4. Setup virtual machine

    4.1 Click on New
        specify some name
        Select Type as Linux
        Select version as Other Linux (64 bit)


       4.2 allocate around 2RAM to virual machine.




       4.3 Point to Virtual Image file extracted in above step.




Now You see one more vm added 






5. Start virtual machine


Click on Start Button it would start your virtual VM and You can use any hadoop component now.




All the best . Happy Hadooping.


World wide big data job market

In this weekend I did some research on big data jobs world wide.For this analysis I took big data job postings from most of the job portals for January 2015.I would like to share some of the insights I got on the same.This might help big data job seekers.

Below are top 10 countries  recruited in the world for Jan 2015.

United States
Australia
Argentina
Canada
Belgium
China
Brazil
India
France

US tops this list holding 15 percentage of jobs.Australia (7 %) and Argentina (7%) take next places.and India (4%) is in ninth place.

 



In India ,Below are top 10 states which recruited on Big data in Jan 2015.


Karnataka
Maharashtra
Andhra Pradesh
Tamil Nadu
Uttar Pradesh
Delhi
Haryana
Gujarat
West Bengal

Karnataka tops this list with around 50 percentage of jobs. Maharastra (16 %) and Andhra pradesh (8%)  take next places.













Below are job titles/positions mostly used/filled in Jan 2015.

Data Scientist
Data Analyst
Big Data Architect
Big Data Engineer
Business Analyst
Senior Data Scientist
Data Engineer
Hadoop Developer
Big Data Hadoop / NoSQL Developer


Data scientist position tops with around 10 percent of jobs.

In US,Below are top 10 states recruiting on big data.

California
New York
Texas
Washington
Varginia
Illinois
Massachusets
Maryland
Arizona

California tops this list with around 25 % of jobs,and New york  (8%) and Texas (6%) took next places .



















Hope this article helps you to understand big data job market.

Good Luck for your big data job.

Word Count using Cascading

In this Post we learn how to write word count program using Cascading.And we run it in local mode on windows within Eclipse.Cascading is a platform for developing big data applications on Hadoop.It has got many benefits over other mapreduce based tools.Cascading has plumbing terminology (taps,pipes etc ...)  to develop applications.

Assume we have input data in c:\data\in\data.txt

This is a Hadoop input file
Hadoop is a bigdata technology 



1 . Define input details

We define input details like path and schema using source tap.Cascading reads the data from source tap .Here We use FileTap to read data from local file.it takes file path and scheme (schema or columns).

String inputPath = "c:\\data\\in\\data.txt";
Tap srctap = new FileTap( new TextLine( new Fields("line" )) , inputPath );


2. Convert line into wrods.

Now we convert line data into words by applying regex filter. 

This
 is
 a
 Hadoop
 input 
file
Hadoop
 is
 a
 bigdata 
technology 

Regex filter converts line data into words using Space delimiter.We use RegexSplitGenerator function withing Each filter.start is the name of it.

Pipe words=new Each("start",new RegexSplitGenerator("\\s+"));



3. Apply group by

We use groupby pipe class to convert words into groups.

Pipe group=new GroupBy(words);

We apply this GroupBy on our last Pipe words.



4. Calculate Count of words.

Count count=new Count();
Pipe wcount=new Every(group, count);

We use count function of Cascading to generate count of words here. and we are applying it on our last group.

 a 2
bigdata 1 
 Hadoop 2
file 1
 input 1
 is 2 
technology 1 
This 1


5. Declare output path.

Now We have generated count and we have to write that output to a path.

String outputPath = "c:\\data\\out";
Tap sinkTap =new FileTap(  new TextLine( new Fields("word" ,"count")), outputPath, SinkMode.REPLACE );

With in FileTap we have declared columns word and count , output path and sink mode as replace.

we are asking cascading to replace existing data of output path .

6. Set properties.

We set main class using AppProps class.


Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, WordCount.class );


7. Create a flow

We create a flow by connecting source tap,sink tap and last operation we performed.

we use LocalFlowConnector  to run the program in local mode.

LocalFlowConnector flowConnector = new LocalFlowConnector();
Flow flow = flowConnector.connect( "wordcount", srctap, sinkTap, wcount );
flow.complete();

Below are jar files required for running this program in eclipse.

Displaying jars.png

Below is the complete program for the same.

import java.util.Properties;

import cascading.flow.Flow;
import cascading.flow.local.LocalFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexSplitGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.local.TextLine;
import cascading.tap.SinkMode;
import cascading.tap.Tap;
import cascading.tap.local.FileTap;
import cascading.tuple.Fields;

/** * Wordcount example in Cascading */
public class WordCount {
public static void main(String[] args) {
String inputPath = "c:\\data\\in\\data.txt";
String outputPath = "c:\\data\\out";
Tap srctap = new FileTap( new TextLine( new Fields("line" )) , inputPath );
Tap sinkTap =new FileTap(  new TextLine( new Fields("word" ,"count")), outputPath, SinkMode.REPLACE );
Pipe words=new Each("start",new RegexSplitGenerator("\\s+"));
Pipe group=new GroupBy(words);
Count count=new Count();
Pipe wcount=new Every(group, count);
Properties properties = new Properties();


AppProps.setApplicationJarClass( properties, WordCount.class );
LocalFlowConnector flowConnector = new LocalFlowConnector();


Flow flow = flowConnector.connect( "wordcount", srctap, sinkTap, wcount );
flow.complete();

}
}











Word Count in Pig Latin

In this Post, we learn how to write word count program using Pig Latin.

Assume we have data in the file like below.

This is a hadoop post
hadoop is a bigdata technology

and we want to generate output for count of each word like below
 

(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)


Now we will see in steps how to generate the same using Pig latin.

 

1.Load the data from HDFS


Use Load statement to load the data into a relation .
As keyword used to declare column names, as we dont have any columns, we declared only one column named line.
 
input = LOAD '/path/to/file/' AS(line:Chararray);

2. Convert the Sentence into words.


The data we have is in sentences. So we have to convert that data into words using
TOKENIZE Function.


(TOKENIZE(line));
(or)
If we have any delimeter like space we can specify as
(TOKENIZE(line,' '));


Output will be like this:


({(This),(is),(a),(hadoop),(class)})
({(hadoop),(is),(a),(bigdata),(technology)})


but we have to convert it into multiple rows like below


(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)

 
3.Convert Column into Rows

 
I mean we have to convert every line of data into multiple rows ,for this we have function called
FLATTEN in pig.


Using FLATTEN function the bag is converted into tuple, means the array of strings
converted into multiple rows.


Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;


Then the ouput is like below


(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)


3. Apply GROUP BY


We have to count each word occurance, for that we have to group all the words.


Grouped = GROUP words BY word;



4. Generate word count 
 
wordcount = FOREACH Grouped GENERATE group, COUNT(words);






We can print the word count on console using Dump.


DUMP wordcount;


Output will be like below.
 

(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)


Below is the complete program for the same.

input = LOAD '/path/to/file/' AS(line:Chararray);
Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;
Grouped = GROUP words BY word;
wordcount = FOREACH Grouped GENERATE group, COUNT(words); 

You may check same word count using Hive .

Writing Macro in Pig Latin

We can develop more reusable scripts in Pig Latin Using Macros also.Macro is a kind of function written in Pig Latin.

We will learn macro in this Post.

We will take sample emp data like below.

eno,ename,sal,dno
10,Balu,10000,15
15,Bala,20000,25
30,Sai,30000,15
40,Nirupam,40000,35

using above data I would like to have employee data who belong to department number 15.

First we will write Piglatin code without using Macro.


1. Example without Macro

Write below code in a file called filterwitoutmacro.pig

emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
empdno15 =filter emp by $3==15;
dump empdno15;

run pig latin code from file.

pig -f /path/to/filterwitoutmacro.pig

Now we will create a macro for filter logic

2. Same example with macro

Define is the keyword used to create a macro ,It will also have returns statement.

return relation/varibale declared within macro should be the last variable/relation within macro code.

2.1 Create a macro

DEFINE myfilter(relvar,colvar) returns x{
$x = filter $relvar by $colvar==15;
}

Above macro takes two values as input,one is relation variable (relvar) and second is column variable (colvar)

macro checks if colvar equals to 15 or not.

2.2  Usage of  macro

we can use myfilter macro like below.

emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
empdno15 =myfilter( emp,dno);
dump empdno15;

we can write macro creation code and macro usage code in same file ,can run file with -f option.

pig -f /path/to/myfilterwithembeddedmacro.pig



3. same example with external macro.

macro code even be in a separate file ,so that we can use it in different pig latin scripts.

to use external macro file in pig latin code we use IMPORT statement.

3.1 write  above macro in separate file called myfilter.macro

--myfilter.macro
DEFINE myfilter(relvar,colvar) returns x{
$x = filter $relvar by $colvar==15;
}

3.2 Import macro file in another pig latin script file.


IMPORT '/path/to/myfilter.macro'
emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
empdno15 =myfilter( emp,dno);
dump empdno15;


and we run pig latin script file using -f option.

pig -f /path/to/myfilterwithexternalmacro.pig


So what is the use of macro ?

we can use macro as many times as we wish and on different inputs also.

for example ,with respect to above example ,If I want to check employee details who has 15 as their employee number.

then we can write pig latin code like below.



IMPORT '/path/to/myfilter.macro'
emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
eno15 =myfilter( emp,eno);
dump eno15;


So That concludes ,we can write highly reusable scripts in Pig latin using macros.

also visit pig official documentation on macro.

upto some extent , we can write reusable scripts in Pig latin using parameter substitution also.