Search This Blog

Writing Macro in Pig Latin

We can develop more reusable scripts in Pig Latin Using Macros also.Macro is a kind of function written in Pig Latin.

We will learn macro in this Post.

We will take sample emp data like below.

eno,ename,sal,dno
10,Balu,10000,15
15,Bala,20000,25
30,Sai,30000,15
40,Nirupam,40000,35

using above data I would like to have employee data who belong to department number 15.

First we will write Piglatin code without using Macro.


1. Example without Macro

Write below code in a file called filterwitoutmacro.pig

emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
empdno15 =filter emp by $3==15;
dump empdno15;

run pig latin code from file.

pig -f /path/to/filterwitoutmacro.pig

Now we will create a macro for filter logic

2. Same example with macro

Define is the keyword used to create a macro ,It will also have returns statement.

return relation/varibale declared within macro should be the last variable/relation within macro code.

2.1 Create a macro

DEFINE myfilter(relvar,colvar) returns x{
$x = filter $relvar by $colvar==15;
};

Above macro takes two values as input,one is relation variable (relvar) and second is column variable (colvar)

macro checks if colvar equals to 15 or not.

2.2  Usage of  macro

we can use myfilter macro like below.

emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
empdno15 =myfilter( emp,dno);
dump empdno15;

we can write macro creation code and macro usage code in same file ,can run file with -f option.

pig -f /path/to/myfilterwithembeddedmacro.pig



3. same example with external macro.

macro code even be in a separate file ,so that we can use it in different pig latin scripts.

to use external macro file in pig latin code we use IMPORT statement.

3.1 write  above macro in separate file called myfilter.macro

--myfilter.macro
DEFINE myfilter(relvar,colvar) returns x{
$x = filter $relvar by $colvar==15;
}

3.2 Import macro file in another pig latin script file.


IMPORT '/path/to/myfilter.macro'
emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
empdno15 =myfilter( emp,dno);
dump empdno15;


and we run pig latin script file using -f option.

pig -f /path/to/myfilterwithexternalmacro.pig


So what is the use of macro ?

we can use macro as many times as we wish and on different inputs also.

for example ,with respect to above example ,If I want to check employee details who has 15 as their employee number.

then we can write pig latin code like below.



IMPORT '/path/to/myfilter.macro'
emp = load '/data/employee'using PigStorage(',') as (eno,ename,sal,dno);
eno15 =myfilter( emp,eno);
dump eno15;


So That concludes ,we can write highly reusable scripts in Pig latin using macros.

also visit pig official documentation on macro.

upto some extent , we can write reusable scripts in Pig latin using parameter substitution also.




User Defined Functions in Hive

We have three types of functions in hive ,first one is single row function they operate on single row at a time.
second one is multi row function they can operate on multiple rows at a time and third is table generating function they generate multiple rows out of a single row
Hive has good number of built in functions in these categories ,you can check all of them using

show functions;

If you want to understand one particular function like concatYou can use

describe function concat

It displays small help page for concat function.

However sometimes you may also need to write your own function if you do not find any suitable function for you.

These custom functions can be of three types

1.Single row function (UDF =User Defined Function)
2.Multi row function (UDAF=User Defined Aggregate Function)
3.Table generation function (UDTF =User Defined Table generating Function)

In this, we learn how to develop UDF in hive.

Assume we have a table emp with data like below.

eno,ename,sal,dno

10,Balu,100000,15
20,Bala,200000,25
30,Sai,200000,35
40,Nirupam,300000,15

In this we develop a custom function which prepends Hi to employee name.

Below are steps for the same.


1.write a UDF by extending UDF class using Eclipse

To develop UDF ,we should extend UDF class of hive-exec.jar and override evaluate method of it.

public class HiPrepender extends  UDF {

public Text evaluate(Text column){
if(column!=null&&column.getLength()>0){
return new Text("Hi "+column.toString());
}
return null;
}

}

for this you need to have 3 jar files on classpath

hadoop-core*.jar
hive-exec*.jar
apache-commons*.jar



2.Create a jar file for above program



File---->export---->jar file----->specify file path for jar--->next--->do not select main class---->finish

assume you created a jar file named hiprepender.jar

3.Transfer jar file to unix box using filzilla/winscp,if you are not on the same .

if you are on other operating system like windows ,you have to transfer it to machine from where you are running hive queries.

assume you have transferred your jar file to /root directory.


4.From Hive prompt ,add jar file to your class path

hive > add jar /root/hiprepender.jar

5. Create a temporary function

create temporary function prependhi as 'HiPrepender';

Here HiPrepender is the classname we wrote in the first step.

6. Use the custom function;

select hiprepend(ename) from emp;

you will get output like below

Hi Balu
Hi Bala
Hi Sai
Hi Nirupam

In coming articles we learn UDAF and UDTF.

Loading data into Hive Table

We can load data into hive table in three ways.Two of them are DML operations of Hive.Third way is using hdfs  command.If we have data in RDBMS system like Oracle,Mysql,DB2 or SQLServer we can import it using SQOOP tool.That Part we are not discussing now.

To Practice below commands ,create a table called Employee with below data

eno,ename,salary,dno

11,Balu,100000,15
12,Radha,120000,25
13,Nitya,150000,15
14,Sai Nirupam,120000,35



1. Using Insert Command

We can load data into a table using Insert  command in two ways.One Using Values command and other is using queries.

     1.1 Using Values
       Using Values command ,we can append more rows of data into        existing table.
       for example ,to existing above employee table we can add                extra row 15,Bala,150000,35 like below

Insert into table employee values (15,'Bala',150000,35)

After this You can run a select command to see newly added row.

     1.2 Using Queries




You can also upload query output into a table.for example Assume you have emp table,from this, you can upload data into employee table like below

Insert into table employee Select * from emp where dno=45;

After this also You can fire select query to see uploaded rows.

2.Using Load



You can load data into a hive table using Load statement in two ways.
One is from local file system to hive table and other is from  HDFS to Hive table.

  2.1 From LFS to Hive Table

Assume we have data like below in LFS file called /data/empnew.csv.
15,Bala,150000,35 
Now We can use load statement like below.

Load data local inpath '/data/empnew.csv' into table emp

2.2 From HDFS to Hive Table

if we do not use local keyword ,it assumes it as a HDFS Path.

Load data local inpath '/data/empnew.csv' into table emp

After these two statements you can fire a select query to see loaded rows into table.

3. Using HDFS command



Assume You have data  in a local file,You can simply upload data using hdfs commands.

run describe command to get the location of table like below.

describe formatted employee;

It will display Location of the table ,Assume You got location as /data/employee, you can upload data into table by using one of below commands.

hadoop fs -put /path/to/localfile /Data/employee

hadoop fs -copyFromLocal /path/to/localfile /Data/employee

hadoop fs -moveFromLocal /path/to/localfile /Data/employee








Managed table and External table in Hive

There are two types of tables in Hive ,one is Managed table and second is external table.
the difference is , when you drop a table, if it is managed table hive deletes both data and meta data,if it is external table Hive only deletes metadata.
Now we learn few  things about these two

1. Table Creation

by default It is Managed table .
If you want to create a external table ,you will use external keyword.

for example assume you have emp.csv file under directory /data/employee

to create a managed table we use normal syntax like below

create table managedemp(col1 datatype,col2 datatype, ....) row format delimited fields terminated by 'delimiter character'
location '/data/employee'

but to create external table ,we use external keyword like below

create external table managedemp(col1 datatype,col2 datatype, ....) row format delimited fields terminated by 'delimiter character'
location '/data/employee'

2. Differentiation

How do you check wether existing table is managed or external table?

To check that we use describe command like below

describe formatted tablename;

It displays complete meta data of a table.you will see one row called table type which will display either MANAGED_TABLE OR EXTERNAL_TABLE

for example if it is managed table ,you will see

Table Type:             MANAGED_TABLE

if it is external table ,you will see

Table Type:             EXTERNAL_TABLE


3.  Drop

As I already said If you drop a managed table both data and meta data will be deleted
if you drop an external table only  meta data is deleted ,external table is a way to protect data against accidental drop commands.

You can check this  by below process.

use describe formatted tablename command and it gives location details like below.

Location :hdfs://namnodeip:portno/data/employee

after dropping the table if you use 
hadoop fs -ls hdfs://namnodeip:portno/data/employee command
you should get no such file or directory exits in case of managed table.
or you should get contents of that directory in case of external table.

and the last line is try to use external table in your project ,once you drop it ,do not forget to remove directory if you do not need it anymore .

Word Count In Hive


In this post I am going to discuss how to write word count program in Hive.

Assume we have data in our table like below

This is a Hadoop Post
and Hadoop is a big data technology

and we want to generate word count like below

a 2
and 1
Big 1
data 1
Hadoop 2
is 2
Post 1
technology 1
This 1

Now we will learn how to write program for the same.


1.Convert sentence into words

 the data  we have is in sentences,first we have to convert that it into words applying space as delimiter.we have to use split function of hive.

split (sentence ,' ')


2.Convert column into rows

Now we have array of strings like this 
[This,is,a,hadoop,Post] 
but we have to convert it into multiple rows like below

This
is
a
hadoop
Post

I mean we have to convert every line of data into multiple rows ,for this we have function called explode in hive and this is also called table generating function.

SELECT explode(split(sentence, ' ')) AS word FROM texttable

and create above output as intermediate table.

(SELECT explode(split(sentence, ' ')) AS word FROM texttable)tempTable

after second step you should get output like below

a
a
and
Big
data
Hadoop
Hadoop
is
is
Post
technology
This


3.Apply group by


after second step , it is straight forward ,we have to apply group by to count word occurrences.

select word,count(1) as count from
(SELECT explode(split(sentence, ' ')) AS word FROM texttable)tempTable
group by word

Cascading for your next hadoop project



Cascading is a platform for developing data applications on hadoop.It can process all types of data like structured ,unstructured and semi structured data. It can be used for most of the business analytics requirements.It is written in java on top of mapreduce.It also has different versions supporting python,ruby,clojure and scala.
in this article , I would like share few benefits if you use cascading in  your big data projects.




1. Need not think in terms of keys and values


Biggest problem of using mapreduce is thinking in terms of keys and values apart from business logic.
Map reduce is very low level API,I feel, most fo times,developing data applications using mapreduce  is same as studying mechanical engineering for learning driving.that is the reason mapreduce based tools like hive and pig are widely adopted .for the same reason ,Cascading can also be used.you need not think in terms of key value programming paradigm,you can focus on business logic.



2. Pure java


When we use mapreduce tools like hive or  pig,if you want to build complex business logic ,again you have to depend on UDFs which requires some programming languages like java or python.so rather than using Hive and  java or pig and java for your project,you can depend on single tool like cascading so you can write your entire code in one programming language like java.


3. Rapid application development


In mapreduce ,you will write sparate program for mapper , separate program for reducer and one driver program,so you will write more lines of code.
in cascading ,you will write only business logic and you will have less number of lies of code.as you will also have built in functions ,you can rapidly develop data applications.in mparreduce you dont have any concept of built in analytical functions and you end up writing lot of code.



4.Customizable


Though It is built on top of Mapreduce ,it allows you to customize API as per user requirements.


5.Easy Integration


We have many technologies in big data space like hadoop,hive,sqoop,oozie,cassandra,hbase,solr,elasticsearch,teradata,splunk and rdbms systems like oracle,mysql and postgres.fortunately cascading provides easy facility to integrate with all of them.
I mean integration with other technologies  is also easy.




6. Proven in production


It is being used by many companies including Twitter.



7.Very good documentation


Cascading provides good documentation in terms of tutorials and user guide.
you can easily start learning the same,It might not take more than one week to start your own application.





8.Testable code



Last but not least ,if we go for hive or Pig you many not able test your code but Cascading is also suitable for test driven developments.
you can confidently deliver quality applications using cascading.

With all these benefits ,I think you can easily consider Cascading for your next hadoop project.


Learning Redis on Windows

Redis (remote dictionary server) is a key value store and also a kind of nosql database.There are many key value data stores and redis is widely used data store.It is also known as data structure server because it supports many data structures like set,list ,stack and queues etc..I explored it to implement a real time analytics project along with Apache storm and you can find other use cases of Redis here .In this tutorial I would like to share how to learn  and practice Redis on windows.


1. Download redis from here. and Install it by running .exe file


remember it is not an official version.may be only useful for practicing it on windows.


2. Start redis server 



C:\Program Files\Redis>redis-server.exe conf/redis.conf


3.Start redis-cli



C:\Program Files\Redis>redis-cli.exe
redis 127.0.0.1:6379>


4 . Practice redis like below


redis 127.0.0.1:6379> set name "balu"
OK
redis 127.0.0.1:6379> get name
"balu"
redis 127.0.0.1:6379>


for more redis commands,visit redis website.
You can also practice redis here without installing it.
Enjoy practicing redis on windows.