Search This Blog

Step by step guide for working on bucketed table in Apache Hive

In this article we will learn how to work on bucketed table.


1)
 Create a bucketed table in Hive. clustered by clause needs to be used for bucketed table.
The code below creates bucketed table and organizes deptno column into two buckets.

create table employee_buckets(name string,salary int,deptno int,DOJ date)
clustered by (deptno) into 2 buckets
row format delimited fields terminated by ',' ;



2)

Load data into bucketed tables from existing table. We need to set hive.enforce.bucketing property to true before loading the data.

The code below loads data into employee_buckets table from employee table. Check How to create a table in Apache Hive.

set hive.enforce.bucketing=true;
insert into table employee_buckets select * from employee;




3) 
 Verify How bucketed table's data is organized  in HDFS directory.

The image below shows  : 
                                      Two files are created in HDFS direcctory as per number of buckets we have mentioned.

deptno columns data is devided into files and Entire data of one deptno column is stored in one file that is in one bucket.


Benefits of Bucketed table :

Joins on bucketed table will be faster as common column is compared from one bucket to another bucket avoiding full table data comparison.
As per above example, deptno data in bucket 000000_0 of employee_buckets will be compared against bucket 000000_0 of dept_buckets table .