Hadoop Lessons: July 2017

How to create a new Hive table using select query output

In this article we will learn how to create a new table in Hive using other query output.

We use create table as select statement to create a new table from select query output data.

Syntax is :

create table [table-name] as [select-query]

Practise below steps to unnderstand this feature better.

1) Create a hive table called employee using this article.

2) Run a select query to get deptno wise employee count on employee table.

select deptno,count(name) from employee group by deptno

If you want to persist above query output in a hive table , You have to use create table as select (CTAS) query.

3) Create a new table called deptnowiseempcount using above select query output.

create table deptnowiseempcount as
select deptno,count(name) from employee group by deptno;

Table deptnowiseempcount is a managed table.

CTAS has some restrictions like below.

New table can not be a external table.
New table can not be partitioned table.
New table table can not be a bucketed table.

4) The query below fails as the query is trying to create external table using CTAS statement.

create external table deptnowiseempcount as select deptno,count(name) from employee group by deptno;

Query fails throwing an error CREATE-TABLE-AS-SELECT cannot create external table.

Creating a new Hive table with same schema of anothe Hive table.

In this article we will learn how to create a table with same schema of another table.

We use like operator in Apache Hive to copy schema from another table.

Syntax is : create table [new-table-name] like [old-table-name];

Practise below steps to understand better.

1) Create a new table called employee in Hive using this article.

2) Check the schema of Hive tables employee using describe command.

describe employee;

3) Create a new table called employee_india with same column names and data types of employee table.

create table employee_india like employee;

4) Now check column names and their data types of table employee_india.

describe employee_india;

We can copy schema not only from existing table, but also from existing existing view. Systax is also same , we just have to replace old-tablename with view name.

How to change the value of a property in Apache Hive?

This article discuses various ways to change the property value of Apache Hive.

1) Changing the property value from Hive prompt or beeline prompt using set command.

This property value only applies for the session. If we quit the prompt, We will lose the property value.

The code below changes execution engine to mr.

set hive.execution.engine=mr;

2) Change the property value Using hiveconf option.

This property value only applies for the query we are running.

The code below changes execution engine to mr using hiveconf option.

hive -hiveconf hive.execution.engine=mr -e "select count(*) from employee"

3) Modifying configuration files to change property values for entire cluster.

Hive has configuration files hive-site.xml,hive-env.sh,hive-server2.xml and hivemetastore-site.xml. We can modify them if want to change property value permanently.

If you are usin CDH, you can change the property value using Cloudera Manager.

If you are using HDP , You can change the property value using Apache Ambari.

We will learn how to change the property using Apache Ambari.

Click on Hive ---> Click on Config --> enter the property name in the filter box.

The picture below shows we are searching for property hive.execution.engine in Amabri.

Change the property value ----> Click on Save ---> Restart Hive service.

The picture below shows how to change execution engine in Ambari.

The picture below show save button and restart button in Apache Ambari for Hive service.

Once you restart the cluster , Changes will become permanent and in this case all subsequent queries will have MR as their execution engine.

Loading compressed data into Hive table.

In this article, We will learn how to load compressed data (gzip and bzip2 formats) into Hive table.

1) Create a file called employee_gz on local file system and convert that into gz format file using gzip command.

Sample : employee data.

Balu,300000,10,2014-02-01
Radha,350000,15,2014-02-05
Nitya,325000,15,2015-02-06
Bubly,350000,25,2015-05-01
Pandu,300000,35,2014-06-01
Nirupam,350000,40,2016-01-01
Sai,400000,25,2015-05-02
Bala,400000,20,2016-10-10

Example :

2)

Create a hive table called employee_gz without any location .

The code below is for creating Hive table.

create table employee_gz(name string,salary int,deptno int,DOJ date)
row format delimited fields terminated by ',';

3)

Load data from local file system file employee_gz to Hive table employee_gz.

The code below loads GZ compressed data in /home/hdfs/employee_gz.gz into hive table employee_gz.

load data local inpath '/home/hdfs/employee_gz.gz' into table employee_gz;

Hive recognizes compressed data and loads it into table. We need not specify that is in gzip format.

Hive also uncompresses the data automatically while running select query.

If we remove local in hive query, Data will be loadedd into Hive table from HDFS location.

4) Check Hive table's data stored in GZ format or not in HDFS.

Now we will check how to load bzip2 format data into Hive table.

5) Create local file called employee_bz2 with bzip2 format.

6) Create a new table called employee_bz2.

The code below creates a hive table called employee_bz2.

create table employee_bz2(name string,salary int,deptno int,DOJ date)
row format delimited fields terminated by ',';

7) Load bzip2 format data into Hive table.

load data local inpath '/home/hdfs/employee_bz2.bz2' into table employee_bz2;

We can load gzip ad bzip2 formats data into Hive table like normal text files. We do not need to specify any format in the query.

Step by step guide to create an ORC table in Apache Hive

In this article , We will learn how to work create hive table with ORC data format.

1)

Create a normal table, You can check this article to create a normal table in Hive.

2)
Create an ORC table, Use stored as orc to create an ORC table. .

create table employee_orc(name string,salary int,deptno int,DOJ date)

row format delimited fields terminated by ','

STORED AS orc ;

3)

Load data from normal table to ORC table. Implicitely data will be converted into ORC data format.

INSERT INTO TABLE employee_orc SELECT * FROM employee;

4)

Create an ORC table with tblproperties. You can specify custome values to ORC format properties.
The code below changes ORC compression to NONE, It's default value isZLIB.

create table employee_orc(name string,salary int,deptno int,DOJ date)
row format delimited fields terminated by ','
STORED AS orc tblproperties ("orc.compress"="NONE");

Step by step guide for execution engines usage in Apache Hive.

In this article, We will learn how to use different execution engines in Apache Hive.

1) Create a table called employee to run next queries. You can check how to create a table in Hive here.

2)
Check available execution engine in hive-site.xml.

grep -C2 -i execution.engine /etc/hive/conf/hive-site.xml

Execution engine defined in hive-site.xml is the default execution engine.The image below shows tez is the execution engine defined in hive-site.xml.

3)
Test the default execution engine.

The image below shows Tez execution engine started by Hive query.

4)
Change execution engine in beeline prompt.

The image below shows mr execution engine set before running query.

set hive.execution.engine=mr;
select count(*) from employee;

5)
Most of the latest Hadoop distributions provide TEZ as their default execution . If you want to chage execution engine at cluster level, You need to modify hive.execution.engine property value using either Ambari (HDP) or Cloudera Manager (CDH) depending on your hadoop distribution.

6)

We can even check execution engine used by a Hive query from Resource manager user interface.

Step by step guide for joins in Apache Hive

This article discusses how to use different joins in Hive in a step by step manner.

1) Create a department table with below data.

Check How to create a table in Apache Hive here.

10,IT
15,Administration
20,Engineering
35,Sales
40,Others
45,HR

create external table department(deptno int,deptname string)
row format delimited fields terminated by ','
location '/user/hdfs/department';

2)
Describe employee table and describe department table.

describe employee;

describe department;

3)

Check data from both the tables. Both tables have deptno as a common column.

select * from employee;

select * from department;

4)

Retrieve data from both the tables where deptno data matching in both tables;

For example : deptno 10 in employee table is matched with deptno 10 in department table, but not deptno 25.

select employee.name , employee.salary , employee.deptno , employee.doj,department.deptno , department.deptname
from employee join department
on employee.deptno=department.deptno;

The fuctionality above is called inner join. We can even specify inner join like below.

select employee.name , employee.salary , employee.deptno , employee.doj,department.deptno , department.deptname
from employee inner join department
on employee.deptno=department.deptno;

We can even specify alias for table names like below.

select e.name , e.salary , e.deptno , e.doj,d.deptno , d.deptname
from employee e inner join department d
on e.deptno=d.deptno;

5)

If we want to see even unmatched rows from employee (left side ) table , Then we can use left outer join like below.

select e.name , e.salary , e.deptno , e.doj,d.deptno , d.deptname
from employee e left outer join department d
on e.deptno=d.deptno;

The output above displays NULL for deptno 25 row as it does not have any data department table.

6)
If we want to see even unmatched rows from department (right side) table , Then we can use right outer join like below.

select e.name , e.salary , e.deptno , e.doj,d.deptno , d.deptname
from employee e right outer join department d
on e.deptno=d.deptno;

The output above displays NULL for deptno 45 row as it does not have any data in employee table.

7)

If we want to see unmatched rows from both tables , we need to use full outer join like below.

select e.name , e.salary , e.deptno , e.doj,d.deptno , d.deptname
from employee e full outer join department d
on e.deptno=d.deptno;

Full outer join opitput contains NULL rows from both left outer and right outer joins.

We have learnt how to use below joins :

Inner joins

Left outer join

Right outer join

Full outer join

Step by step guide for working on bucketed table in Apache Hive

In this article we will learn how to work on bucketed table.

1)
Create a bucketed table in Hive. clustered by clause needs to be used for bucketed table.
The code below creates bucketed table and organizes deptno column into two buckets.

create table employee_buckets(name string,salary int,deptno int,DOJ date)
clustered by (deptno) into 2 buckets
row format delimited fields terminated by ',' ;

2)

Load data into bucketed tables from existing table. We need to set hive.enforce.bucketing property to true before loading the data.

The code below loads data into employee_buckets table from employee table. Check How to create a table in Apache Hive.

set hive.enforce.bucketing=true;
insert into table employee_buckets select * from employee;

Verify How bucketed table's data is organized in HDFS directory.

The image below shows :

Two files are created in HDFS direcctory as per number of buckets we have mentioned.

deptno columns data is devided into files and Entire data of one deptno column is stored in one file that is in one bucket.

Benefits of Bucketed table :

Joins on bucketed table will be faster as common column is compared from one bucket to another bucket avoiding full table data comparison.

As per above example, deptno data in bucket 000000_0 of employee_buckets will be compared against bucket 000000_0 of dept_buckets table .

Step by step guide for working on partitioned table in Apache Hive

We have seen how to create a table in Hive, Now we will see how to create a partitioned table and how to perform operations on it.

1) Partioned table requires extra option called partioned by in create table syntax.
The command below creates a partitioned table called employee_partitions.

create table employee_partitions(name string,salary int,deptno int,DOJ date) partitioned by(joinyear string) row format delimited fields terminated by ',' ;

2)

Run show partitions command to see if any partitions available in the table. As above create command does not have location option included,It will not show any partitions.

show partitions employee_partitions.

3)
Run a select query on normal table created in how to create Hive table article and insert the output into employee_partitions table.

The query below retrieves 2014 employees from employee table and inserts into employee_partitions table partition called 2014.

Insert into employee_partitions partition (joinyear=2014) select name,salary,deptno,doj from emp where year(doj)=2014;

4)
Now run the show partitions command and you will see one partition displayed.

5)

Check Hive table's HDFS location using describe formatted command. Location row displays hdfs path for table.

describe formatted employee_partitions;

6)

Check how partitioned data is stored in HDFS. You will find partitioned data is stored under subdirectory of table's HDFS directory.

hdfs dfs -ls hdfs://apslabc1/user/hdfs/data/employee_partitions

7) In step 3, We have loaded one partition data into employee_part table. If we want to load multiple partitions data one time, We have to use the query below. This is also called dynamic loading of partitions;

set hive.exec.dynamic.partition.mode=nonstrict;
Insert into employee_partitions partition (joinyear) select name,salary,deptno,doj,year(doj) from employee;

8)

Now check all partitions.

show partitions employee_partitions;

9)

If we want to specify location at the time of creating partitioned table, Location should have data as per partitions.

The output below has partition folders and can be used in location option .

10)

We can load data into partitioned table from local file system using load data statement like below.

Hope above queries help you, Happy hadooping!!!

Step by step guide for creating a table in Apache Hive

This article covers how to create a table,how to check schema of the table and how to run Hive query.

1)
Create a local file named employe on the unix like below.

Sample : employee data.

Balu,300000,10,2014-02-01
Radha,350000,15,2014-02-05
Nitya,325000,15,2015-02-06
Bubly,350000,25,2015-05-01
Pandu,300000,35,2014-06-01
Nirupam,350000,40,2016-01-01
Sai,400000,25,2015-05-02
Bala,400000,20,2016-10-10

2)
Create a directory called employee in HDFS.

hdfs dfs -mkdir /user/hdfs/employee

3)
Upload the above file (step 1) to above HDFS directory employee.

hdfs dfs -put employee /user/hdfs/employee

4) Confirm the local file employee is uploaded into HDFS directory employee.
hdfs dfs -ls /user/hdfs/employee

5)
Open hive prompt or beeline prompt. Hive prompt is outdated , Try to use beeline.

6)
Create table called employee in Hive using hive prompt .

create table employee(name string,salary int,deptno int,DOJ date)
row format delimited fields terminated by ',' location '/user/hdfs/employee';