Follow by Email


Parameter substitution in Pig

Earlier I have discussed about writing reusable scripts using Apache Hive, now we see how to achieve same functionality using Pig Latin.
Pig Latin has an option called param, using this we can write dynamic scripts .

Assume ,we have a file called numbers with below data.
If we want to list numbers equal to 12 ,then we write pig latin code like below.

Numbers = load ‘/data/numbers’ as (number:int);

specificNumber = filter numbers by number==12;

Dump specificNumber;

Usually we write above code in a file .let us assume we have written it in a file called numbers.pig

And we write code from file using

Pig –f /path/to/numbers.pig

Later if we want to see only numbers equals to 34, then we change second line to

specificNumber = filter numbers by number==34;

and we re-run the code using same command.
But Its not a good practice to touch the code in production ,so we can make this script dynamic by using –param option of Piglatin.
Whatever values we want to decide at the time of running we make them dynamic .now we want to decide number to be filtered at the time running job,we can write second line like below.

specificNumber = filter numbers by number==$dynanumber

and we run code like below.

Pig –param dynanumber=12  –f numbers.pig

Assume we even want to take path at the time of running script, now we write code like below

Numbers = load ‘$path’ as (number:int);

specificNumber = filter numbers by number==’$ dynanumber';

Dump specificNumber;

And run like below

Pig –param path=/data/path –param dynanumber =34 –f numbers.pig

If you feel this code is missing readability, we can specify all these dynamic values in a file like below
##Dyna.params (file name)

Path = /data/numbers

dynanumber = 34

Then you can run script with param-file option like below.

Pig –param-file dyna.params –f numbers.pig

This type of feature is not available in apache hive.

So what are the benefits we gain using this feature.

1.       We can avoid hard coding in pig scripts
2.       Of course, we make scripts more reusable and dynamic.
3.       We can have better productivity using reusable scripts.

Happy  Hadooping friends.

Reusable scripts in hive

If I want to see top ten rows of a table (users) in Hive.
I will write query like below

select * from users limit 10;

will save it to a file in unix ,say topn.q

and will run the query like below

hive -f topn.q

Problem with the above script is.

1.table name and number are hard coded so if we have same requirement on different table or different number ,we have to write the new script or modify the existing one.
because of the above reason we have to write reusable scripts.
we can achieve the same with the help of hiveconf in hive.
hiveconf is handy to substute variables in hive script at runtime.
let us learn how to avoid hard coding in above script by using hiveconf.

change the above script to below.

select * from ${hiveconf:tablename} limit ${hiveconf:number}

save above script in a file, forexample dynatopN.q

now we can pass the table name  and number at the time of running query like below.

hive -hiveconf tablename=users -hiveconf number=10 -f dynatopN.q

even we can change the tablename and number like below

hive -hiveconf tablename=movies -hiveconf number=20 -f dynatopN.q

we should rarely touch the production better to use hiveconf in production scripts also.

To achieve the same in Pig we use -param option while running the script and we use $ symbol inside the script.
If number of parameters we are passing at runtime getting increased, it is hard to maintain such scripts in hive.but in Pig provides one more option -paramfile you can specify the filename where all parameter names and values are maintained.
So pig is more flexible than hive.
This approach is also recommended for production scripts .once query  ran successfully , we should avoid touching it as much as possible.