Difference between Apache Hive and Apache Pig

MapReduce follows key-value programming model. It has two core stages Map and Reduce.
Both Map and Reduce have key-value as an input and key-value as an output. To write Map Reduce applications ,we need to know one programming language like Java.
These MapReduce applications will have a Map program , a reduce program and a driver program to run map and reduce programs.We need to create a jar containing these programs to process the data.

This Mapreduce has lengthy development time and may not be suitable for situations like adhoc querying. That is one of the reasons there are so many abstractions available for Mapreduce.
For example Cascading, Apache Crunch, Apache Hive and Apache Pig etc...All of these hide key-value complexity for developer. We will now discuss differences between Apache Hive and Apache Pig.



Apache Hive       VS   Apache Pig






Types of Data they support


Apache Hive :  

Hive is a scalable data warehouse on top of Apache Hadoop. As data is available in tables it only supports structured data . processing semi structured data is difficult and processing unstructured data is very very difficult.

Apache Pig :

Pig is a platform for processing large data sets. Its query language is called Pig latin. Pig latin can process structured ,semi structured and unstructured data.



Programming model


Apache Hive :  Hive query language is declarative programming language. It is not easy to build complex business logic.

Apache Pig : Pig Latin is an imperative programming language , You can easily write complex business logic.


Integration


Apache Hive :   Hive has a component called HCatalog that provides cross platform schema.
It also has Rest API called WebHCatalog. So You can integrate any tool with Apache Hive.
Already Teradata, Aster Data got integrated with apache Hive. Even Pig can process data using WebHCatalog.

Apache Pig : It does not have any such feature. Because it is processing platform not a storage platform.



Debugging


Apache Hive : We can debug hive queries but not that easy.

Apache Pig : Pig Latin is a data flow language It is designed keeping debugging feature in mind.

So We can easily debug Pig Latin scripts.


Learning


Both can be easily learned . Hive is almost same as SQL. Pig Latin also looks like SQL .

One can easily learn hive and start writing queries to process data.

Industry Adoption


Apache Hive : It is more widely used in the industry than Apache Pig. 


Adhoc Querying


Both can be used for adhoc querying Hive is more suitable than Pig if it is structured data.


Complex Business logic


If you have to develop applications that have so much business complexity. It is better to use Apache Pig rather than using Hive.

Pig is widely used in research applications than Hive for the same reason.

Let me know if you want to compare these two for any other use-case.







No comments:

Post a Comment