Hadoop Lessons: Difference between Apache Hive and Apache Pig

MapReduce follows key-value programming model. It has two core stages Map and Reduce.
Both Map and Reduce have key-value as an input and key-value as an output. To write Map Reduce applications ,we need to know one programming language like Java.
These MapReduce applications will have a Map program , a reduce program and a driver program to run map and reduce programs.We need to create a jar containing these programs to process the data.

This Mapreduce has lengthy development time and may not be suitable for situations like adhoc querying. That is one of the reasons there are so many abstractions available for Mapreduce.
For example Cascading, Apache Crunch, Apache Hive and Apache Pig etc...All of these hide key-value complexity for developer. We will now discuss differences between Apache Hive and Apache Pig.

VS

Types of Data they support

Apache Hive :

Hive is a scalable data warehouse on top of Apache Hadoop. As data is available in tables it only supports structured data . processing semi structured data is difficult and processing unstructured data is very very difficult.

Apache Pig :

Pig is a platform for processing large data sets. Its query language is called Pig latin. Pig latin can process structured ,semi structured and unstructured data.

Programming model

Apache Hive : Hive query language is declarative programming language. It is not easy to build complex business logic.

Apache Pig : Pig Latin is an imperative programming language , You can easily write complex business logic.

Integration

Apache Hive : Hive has a component called HCatalog that provides cross platform schema.

It also has Rest API called WebHCatalog. So You can integrate any tool with Apache Hive.

Already Teradata, Aster Data got integrated with apache Hive. Even Pig can process data using WebHCatalog.

Apache Pig : It does not have any such feature. Because it is processing platform not a storage platform.

Debugging

Apache Hive : We can debug hive queries but not that easy.

Apache Pig : Pig Latin is a data flow language It is designed keeping debugging feature in mind.

So We can easily debug Pig Latin scripts.

Learning

Both can be easily learned . Hive is almost same as SQL. Pig Latin also looks like SQL .

One can easily learn hive and start writing queries to process data.

Industry Adoption

Apache Hive : It is more widely used in the industry than Apache Pig.

Adhoc Querying

Both can be used for adhoc querying Hive is more suitable than Pig if it is structured data.

Complex Business logic

If you have to develop applications that have so much business complexity. It is better to use Apache Pig rather than using Hive.

Pig is widely used in research applications than Hive for the same reason.

Let me know if you want to compare these two for any other use-case.

Technology

Search This Blog

Difference between Apache Hive and Apache Pig

VS