Follow by Email

7 reasons why we should use Hadoop Streaming with Unix


Hadoop Streaminng is a utility that comes with hadoop and it allows you to use any executable program for bigdata analysis.
We can use  languages like Java,Python,PHP ,Scala ,Perl and many more .It also supports Unix commands and Shell scripts.
I was using hadoop streaming with unix or shell script extensively and I enjoyed it for several reasons.

I would like to share  benefits of hadoop streaming with unix.

1. Availability

If you want to use a tool/technology other than mapreduce,you may be running for hive or pig .if you go for Hive or Pig you have to install  and manage them separately (if you use vendor hadoop you will get it by default) .otherwise you could use hadoop streaming with Unix which you need not install it separately .


2. Learning

You need not learn new tool /technology like Hive or Pig if you do not have serious requirement.You can leverage your Unix skills for data analysis on hadoop .

3. Less development time

for developing java Mapreduce applications,you have to compile your code ,unit test it,package it,export jar file and run it finally. unlike  Java Mapreduce you can quickly develop streaming applications with Unix by directly writing mapper and reducer code in mapper and reducer options

4. Quick conversion

As I said It has less development time, we can quickly convert data from one format to another.I heavily used it for converting data from text to sequence file and sequence file to text .we can use inputformat and outputformat options in hadoop streaming for the same.

5. Testing data

for the  same reason, as I said, it has less development time we can quickly test the input data and output data by using hadoop streaming with Unix or shell script


6. Simple business requirement

for simple business requirement , we can always use Hadoop streaming with Unix .like for simple filtering operations and simple aggregation operations.

7. Performance

finally ,I read it somewhere hadoop streaming with Unix has better performance over mapreduce ,Hive and Pig.I personally not tested it though.

So try to use hadoop streaming with Unix if you have any above requirements.

for more details on how to use hadoop streaming with unix ,read it .

Happy Hadooping Friends.

Parameter substitution in Pig

Earlier I have discussed about writing reusable scripts using Apache Hive, now we see how to achieve same functionality using Pig Latin.
Pig Latin has an option called param, using this we can write dynamic scripts .

Assume ,we have a file called numbers with below data.
12
23
34
12
56
34
57
12
If we want to list numbers equal to 12 ,then we write pig latin code like below.


Numbers = load ‘/data/numbers’ as (number:int);

specificNumber = filter numbers by number==12;

Dump specificNumber;


Usually we write above code in a file .let us assume we have written it in a file called numbers.pig

And we write code from file using


Pig –f /path/to/numbers.pig


Later if we want to see only numbers equals to 34, then we change second line to


specificNumber = filter numbers by number==34;


and we re-run the code using same command.
But Its not a good practice to touch the code in production ,so we can make this script dynamic by using –param option of Piglatin.
Whatever values we want to decide at the time of running we make them dynamic .now we want to decide number to be filtered at the time running job,we can write second line like below.


specificNumber = filter numbers by number==$dynanumber


and we run code like below.


Pig –param dynanumber=12  –f numbers.pig


Assume we even want to take path at the time of running script, now we write code like below


Numbers = load ‘$path’ as (number:int);

specificNumber = filter numbers by number==’$ dynanumber';

Dump specificNumber;


And run like below


Pig –param path=/data/path –param dynanumber =34 –f numbers.pig


If you feel this code is missing readability, we can specify all these dynamic values in a file like below
##Dyna.params (file name)


Path = /data/numbers

dynanumber = 34


Then you can run script with param-file option like below.


Pig –param-file dyna.params –f numbers.pig


This type of feature is not available in apache hive.

So what are the benefits we gain using this feature.

1.       We can avoid hard coding in pig scripts
2.       Of course, we make scripts more reusable and dynamic.
3.       We can have better productivity using reusable scripts.

Happy  Hadooping friends.