Search This Blog

Step by step guide for execution engines usage in Apache Hive.


In this article, We will learn how to use different execution engines in Apache Hive.

1)  Create a table called employee to run next queries. You can check how to create a table in Hive  here.

2)
 Check available execution engine in hive-site.xml.

grep -C2 -i execution.engine /etc/hive/conf/hive-site.xml

Execution engine defined in hive-site.xml is the default execution engine.The image below shows tez is the execution engine defined in hive-site.xml.





3)
Test the default execution engine.

The image below shows Tez execution engine started by Hive query.





4)
Change execution engine in beeline prompt.

The image below shows mr execution engine set before running query.

set hive.execution.engine=mr;
select count(*) from employee;



5)
Most of the latest Hadoop distributions provide TEZ as their default execution . If you want to chage execution engine  at cluster level, You need to modify hive.execution.engine property value using either Ambari (HDP) or Cloudera Manager (CDH) depending on your hadoop distribution.

6)

We can even check execution engine used by a Hive query from Resource manager user interface.




Step by step guide for joins in Apache Hive

This article discusses how to use different joins in Hive in a step by step manner.

1) Create a department table with below data.

Check How to create a table in Apache Hive here.


10,IT
15,Administration
20,Engineering
35,Sales
40,Others
45,HR

create external table department(deptno int,deptname string) 
row format delimited fields terminated by ','
location '/user/hdfs/department';



2)
Describe employee table and describe department table.

describe employee;

describe department;





3)

Check data from both the tables. Both tables have deptno as a common column.

select * from employee;

select * from department;



4)

Retrieve data from both the tables where deptno data matching in both tables;

For example : deptno 10 in employee table is matched with deptno 10 in department table, but not deptno 25.

select employee.name  , employee.salary  , employee.deptno , employee.doj,department.deptno , department.deptname  
from employee join department 
on employee.deptno=department.deptno;





The fuctionality above is called inner join. We can even specify inner join like below.

select employee.name  , employee.salary  , employee.deptno , employee.doj,department.deptno , department.deptname  
from employee inner join department 
on employee.deptno=department.deptno;

We can even specify alias for table names  like below.

select e.name  , e.salary  , e.deptno , e.doj,d.deptno , d.deptname  
from employee e inner join department d
on e.deptno=d.deptno;



5)

If we want to see even unmatched rows from employee (left side ) table , Then we can use left outer join like below.

select e.name  , e.salary  , e.deptno , e.doj,d.deptno , d.deptname  
from employee e left outer join department d
on e.deptno=d.deptno;


The output above displays NULL for deptno 25 row as it does not have any data department table.

6)
If we want to see even unmatched rows from department (right side) table , Then we can use right outer join like below.

select e.name  , e.salary  , e.deptno , e.doj,d.deptno , d.deptname  
from employee e right outer join department d
on e.deptno=d.deptno;



The output above displays NULL for deptno 45 row as it does not have any data in employee table.

7)

If we want to see unmatched rows from both tables , we need to use full outer join like below.

select e.name  , e.salary  , e.deptno , e.doj,d.deptno , d.deptname  
from employee e full outer join department d
on e.deptno=d.deptno;


Full outer join opitput contains NULL rows from both left outer and right outer joins.

We have learnt how to use below joins :

Inner joins
Left outer join
Right outer join
Full outer join


Step by step guide for working on bucketed table in Apache Hive

In this article we will learn how to work on bucketed table.


1)
 Create a bucketed table in Hive. clustered by clause needs to be used for bucketed table.
The code below creates bucketed table and organizes deptno column into two buckets.

create table employee_buckets(name string,salary int,deptno int,DOJ date)
clustered by (deptno) into 2 buckets
row format delimited fields terminated by ',' ;



2)

Load data into bucketed tables from existing table. We need to set hive.enforce.bucketing property to true before loading the data.

The code below loads data into employee_buckets table from employee table. Check How to create a table in Apache Hive.

set hive.enforce.bucketing=true;
insert into table employee_buckets select * from employee;




3) 
 Verify How bucketed table's data is organized  in HDFS directory.

The image below shows  : 
                                      Two files are created in HDFS direcctory as per number of buckets we have mentioned.

deptno columns data is devided into files and Entire data of one deptno column is stored in one file that is in one bucket.


Benefits of Bucketed table :

Joins on bucketed table will be faster as common column is compared from one bucket to another bucket avoiding full table data comparison.
As per above example, deptno data in bucket 000000_0 of employee_buckets will be compared against bucket 000000_0 of dept_buckets table .


Step by step guide for working on partitioned table in Apache Hive

We have seen how to create a table in Hive, Now we will see how to create a partitioned table and how to perform operations on it.

1) Partioned table requires extra option called partioned by in create table syntax.
The command below creates a partitioned table called employee_partitions.

create table employee_partitions(name string,salary int,deptno int,DOJ date) partitioned by(joinyear string) row format delimited fields terminated by ',' ;



2)

Run show partitions command to see if any partitions available in the table. As above create command does not have location option included,It will not show any partitions.

show partitions employee_partitions.




3)
Run a select query on normal table created in how to create Hive table article and insert the output into  employee_partitions table.

The query below retrieves 2014 employees from employee table and inserts into employee_partitions table partition called 2014.

 Insert into employee_partitions partition (joinyear=2014) select name,salary,deptno,doj from emp where year(doj)=2014;





4)
Now run the show partitions command and you will see one partition displayed.



5)

Check Hive table's HDFS location using describe formatted command. Location row displays hdfs path for table.

describe formatted employee_partitions;

6)

Check how partitioned data is stored in HDFS. You will find partitioned data is stored under subdirectory of table's HDFS directory.

hdfs dfs -ls hdfs://apslabc1/user/hdfs/data/employee_partitions





7) In step 3, We have loaded one partition data into employee_part table. If we want to load multiple partitions data one time, We have to use the query below.  This is also called dynamic loading of partitions;

set hive.exec.dynamic.partition.mode=nonstrict;
 Insert into employee_partitions partition (joinyear) select name,salary,deptno,doj,year(doj) from employee;



8)

Now check all partitions.

show partitions employee_partitions;




9)

If we want to specify location at the time of creating partitioned table, Location should have data as per partitions.

The output below has partition folders and can be used in location option .



10)

We can load data into partitioned table from local file system using load data statement like below.


Hope above queries help you, Happy hadooping!!!




Step by step guide for creating a table in Apache Hive

This article covers how to create a table,how to check schema of the table and how to run Hive query.


1)
Create a local file named employe on the unix like below.

Sample : employee data.


Balu,300000,10,2014-02-01
Radha,350000,15,2014-02-05
Nitya,325000,15,2015-02-06
Bubly,350000,25,2015-05-01
Pandu,300000,35,2014-06-01
Nirupam,350000,40,2016-01-01
Sai,400000,25,2015-05-02
Bala,400000,20,2016-10-10




2)
Create a directory called employee in HDFS.

hdfs dfs -mkdir /user/hdfs/employee





3)
 Upload the above file (step 1) to above HDFS directory employee.

hdfs dfs -put employee /user/hdfs/employee






4) Confirm the local file employee is uploaded into HDFS directory employee.
hdfs dfs -ls /user/hdfs/employee





5)
Open hive prompt or beeline prompt. Hive prompt is outdated , Try to use beeline.





6)
Create table called employee in Hive using hive prompt .

create table employee(name string,salary int,deptno int,DOJ date)
row format delimited fields terminated by ',' location '/user/hdfs/employee';





7)

Check table is created in Hive.
show tables;



8)

Check the columns (schema) in table employee;
describe employee;


9)

Display any two rows in the table employee.

select * from employee limit 2;




HDFS Commands examples

This article cover simple and useful HDFS commands examples. Apache Hadoop is data processing tool at web scale. Hadoop contains three modules :

HDFS                          ---   Data storage system

MAPREDUCE           ---    Data processing framework

YARN                         ---    Resource management system for Hadoop


We will discuss some commands to learn how to interact with Hadoop distributed File System (HDFS). All hdfs file system commands start with hdfs dfs.  Most of the hadoop distributions  (CDH, HDP) come with standard hdfs user.

1)

Change the current user  root to HDFS user. Mostly hdfs user will be password less user.









2).



Create a file on the local file system using cat command.




3).


Create a directory in HDFS using mkdir command.





4).


Upload local file helloworld to HDFS directory helloworld using put command.



5).


Check file is loaded into hdfs directory helloworld using ls command.







6).


Read hdfs file content using cat command.







7).


Rename HDFS file helloworld to helloworldfile using mv command.









8)

Copy helloworldfile to another hdfs directory using cp command.












9)

Check the size of a file in HDFS using du command.





10).


Check replication factor of a file in HDFS using ls command. Rounded number is the replication factor of the file.




11).

Change the replication factor of a file in HDFS using setrep command.The example below changes replication factor from 3 to 2.







Enabling debug logs in Apache Hadoop and Hortonworks Data Platform


To trouble shoot hadoop issues we need to have debug logs to get more low level errors.by default debug logs are not enabled in Hortonworks data platform and in plain hadoop also.
In this post,we will discuss how to enable debug logs in HDP and plain hadoop.

1. Modify /var/lib/ambari-server/resources/stacks/HDP/2.0.6/hooks/after-INSTALL/templates/hadoop-env.sh.j2


In HDP,we need to add below line to hadoop-env.sh.j2 to enable debug logs on HDFS services.


export HADOOP_ROOT_LOGGER=DEBUG,console


If you are using Plain hadoop you can directly add above line to /etc/hadoop/conf/hadoop-env.sh file in all nodes and restart HDFS deamons on all nodes.
If you are using Hortonworks Data Platform ,You need to follow below steps.

2. Restart Ambari agents on all nodes


We need to restart ambari agent on all master nodes,data nodes and edge nodes if any. We need to run below command for restarting.

service ambari-agent restart.

Or we can stop and start ambari agent.

service ambari-agent stop
service ambari-agent start


3. Restart Ambari server


We also need to restart ambari server on first master.We can run below command.

service ambari-server restart

Or we can stop and start ambari server.

service ambari-server stop
service ambari-server start

4. Restart HDFS services


In Ambari UI ,Click on HDFS , click service actions drop down and select restart all option.It will ask for confirmation once you confirm, it will restart all HDFS daemons.

Once debug logs are enabled you can check them in name node logs and data node logs.Debug logs consume lot of space you need to disable them once you collect required logs. to disable remove added export command and restart above mentioned services.

Fixing HDFS issues

fsck command scans all files and directories in HDFS for errors and abnormal conditions.  This has to be run by administrator periodically and also name node runs it and fixes most of the issues periodically.

Below is the command syntax and it needs to be run as hdfs user.

hdfs fsck <path>

We can specify root (/) directory to check for errors on complete HDFS or we can specify directory to check for errors in it.

fsck report contains

Displays under-replicated blocks,over-replicated, mis-replicated and corrupt blocks.

Displays number of total files and directories available in HDFS. 

Default replication factor and available average replication factor .

Number of data nodes and number of racks are also displayed in fsck report.

Finally it displays file system status as healthy or corrupt.

fsck final status needs to be healthy, If it is corrupt it needs to be fixed by either administrator or most of issues will be fixed by name node automatically over a period of time.

Below is sample fsck output.


hdfs fsck /

Total size:    466471737404 B (Total open files size: 27 B)

 Total dirs:    917
 Total files:   2042
 Total symlinks:                0 (Files currently being written: 3)
 Total blocks (validated):      4790 (avg. block size 97384496 B) (Total open file blocks (not validated): 3)
  ********************************
  CORRUPT FILES:        9
  MISSING BLOCKS:       9
  MISSING SIZE:         315800 B
  CORRUPT BLOCKS:       9
  ********************************
 Minimally replicated blocks:   4781 (99.81211 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       4274 (89.227554 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.0885177
 Corrupt blocks:                9
 Missing replicas:              4280 (29.944729 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Sun Mar 20 12:52:45 EDT 2016 in 244 milliseconds


The filesystem under path '/' is CORRUPT






under replicated blocks
over replicated blocks


dfs.replication in hdfs-site.xml specifies required number of replicas for a block on cluster. If number of replicas are less than that they are called under replicated blocks. This might happen when data nodes go down. If number of replicas are higher than that they are called over replicated blocks. Over replicated blocks might happen when crashed data nodes come back to normal.

under and over replicated blocks can be addressed with setrep command or name node will fix it after some point of time.

hdfs dfs -setrep -w 3 /path


  1. If you have 2 replicas but required 3 replicas,set replication factor 3. and also if you have 4 replicas but required 3 then also set replication factor 3.
  2. Run balancer ,some times it should also fix it.
  3. Copy under/over replicated file to different location and remove that under/over replicated file. Now rename copied file to original name. You need to be careful to use this trick.If you remove under/over replicated file,Jobs using that file might fail.
After replication factor is set, Use hdfs dfs -ls command on the file that also displays replication factor.

Corrupted blocks

We  should delete corrupted files and we can set appropriate replucate factor after that.
We need to use hdfs fsck / -delete command to delete corrupted files.

We can check corrupted blocks using  hdfs fsck / -list-corruptfileblocks command.

Missing blocks


    --find out which node has missing blocks and check if data node is running or not if possible try with data node restart. we can check data node status from active name node UI or run jps command on all data nodes to check if data node is running or not.

Administrator has to run fsck command reguralry to check hadoop file system for errors. and He has to take necessary actions against errors to avoid data loss.

fsck command has several options,Some of them are

-files

     It displays files of directory.

hdfs@cluster10-1:~> hdfs fsck / -files
/user/oozie/share/lib/sqoop/commons-io-2.1.jar 163151 bytes, 1 block(s):  OK

-blocks

       It displays blocks information.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks
/user/oozie/share/lib/sqoop/oozie-sharelib-sqoop-4.0.0.2.1.2.0-402.jar 7890 bytes, 1 block(s):  OK
0. BP-18950707-10.20.0.1-1404875454485:blk_1073742090_1266 len=7890 repl=3

-locations
               It displays nodes host name where blocks are stored.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks -locations
/user/oozie/share/lib/sqoop/sqoop-1.4.4.2.1.2.0-402.jar 819248 bytes, 1 block(s):  OK
0. BP-18950707-10.20.0.1-1404875454485:blk_1073742091_1267 len=819248 repl=3 [10.20.0.1:50010, 10.20.0.1:50010, 10.20.0.1:50010]

-delete
               It deletes corrupted blocks. We need to run it when we find corrupted blocks in the cluster.


-openforwrite

                      Displays files opened for writing.

-list-corruptfileblocks

                                  It displays only corrupted blocks for given path.

hdfs fsck / -list-corruptfileblocks
Connecting to namenode via http://cluster10-1:50070
The filesystem under path '/' has 0 CORRUPT files


Checking specific information


If we want to see specific type of files in fsck report we need to use grep command on fsck report.
If we want to see only under replicated blocks we need to grep like below.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"

/data/output/_partition.lst 297 bytes, 1 block(s):  Under replicated BP-18950707-10.20.0.1-1404875454485:blk_1073778630_38021. Target Replicas is 10 but found 4 replica(s).

We can replace under replicated with corrupt to see corrupt files.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks -locations|grep -i corrupt
Connecting to namenode via http://cluster10-2:50070
/apps/hbase/data/corrupt <dir>
/data/output/part-r-00004: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778646
/data/output/part-r-00008: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778648
/data/output/part-r-00009: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778649
/data/output/part-r-00010: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778650
/data/output/part-r-00016: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778654
/data/output/part-r-00019: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778659
/data/output/part-r-00020: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778660
/data/output/part-r-00021: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778661
/data/output/part-r-00026: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778663

Status: CORRUPT
  CORRUPT FILES:        9
  CORRUPT BLOCKS:       9
 Corrupt blocks:                9

The filesystem under path '/' is CORRUPT


Above command displays complete information. If we want only file path,we need to use AWK.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"|awk -F " " '{print $1}'
Connecting to namenode via http://cluster10-1:50070
/data/output/_partition.lst

as we have discussed we can set replication factor using setrep command to fix under replicated blocks.when we have so many under-replicated blocks it is difficult to run setrep command on all files.
To avoid manual setting,write all under replicated files path to a file and write a shell script that sets replication factor for all files.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"|awk -F " " '{print $1}' >>underreplicatedfiles



Happy Hadooping.








Your Hadoop job might fail due to invalid gzip files

We can use compressed data formats like gzip ,bzip2 and LZO in Hadoop. Gzip is not splittable and is suitable for small files.bzip2 is splittable,though lzo is by default not splittable we can make it splittable.I have seen many people also using gzip files in hadoop. If you have any corrupt gz files ,your job will fail.It is good idea to test gz files for errors before submitting a job.




gzip command comes with t option that will test existing gz file and If no errors found in file ,no message is thrown.

hdfs@cluter10-1:~> gzip -t test.txt.gz

hdfs@cluter10-1:~> 


If any errors are found it will throw them on console.

hdfs@cluter10-1:~> gzip -t test.gz

gzip: test.gz: not in gzip format
hdfs@cluter10-1:~> 

We can even check files that are available in HDFS for errors.

hdfs@cluter10-1:~> hdfs dfs -cat /data/sample/test.txt.gz|gzip -t


You might have so many gz files in a folder then It is time taking process to test them individually.
We can write small script to check all gz files in a directory for errors . 

First line Take all files from given directory using AWK command that picks 8 occurrence after applying space delimiter. and all files are passed to for loop.

for i in `hdfs dfs -ls gz/*|awk -F" " '{print $8}'`
do
echo "checking $i"
hdfs dfs -cat $i|gzip -t
done

Third line prints file that is being checked now. and Fourth line actually checks file for errors. If your hadoop job fails with file related errors and you have gz files , then this article is useful.

Below error might be related to invalid gz files.


createBlockOutputStream java.io.EOFException: Premature EOF: no length prefix available

Hope this small script is useful for you.

Happy Hadooping.





Analysis of Cloudera customer case studies

Cloudera is a leading big data vendor in the world. It is providing Hadoop support and training services to it's customers.  I always used to think how big data customers are deciding a vendor.
So for fun, I have downloaded all customer case studies of Cloudera . And I did analysis to generate a report about what they are saying in case study.

I have categorized case study text into four topics.

1). What are most commonly mentioned benefits for Cloudera?
2). What were they mostly using before Hadoop? Of course some business use cases are new ones.
3). What is mostly used technology in Cloudera customers?
4). Which Domains  customers  are mostly using Cloudera services?

Domains :

Cloudera  customers are from domains health care, BFSI, Digital marketing ,education and security etc..

So most of the customers are from below three domains.

BFSI
Heath Care
Digital marketing

It is good to know that second most customers are from health care domain. Some thing good happening to humans with big data. Interesting use case is real time monitoring of kids in hospitals. Seems Nationl Children hospital has improved patient care for kids using hadoop .

Migration :
                    many customers did not mention what they were using before hadoop. And some customers have fresh use cases for hadoop. Some customers mentioned below technologies as their old technology stack.

RDBMS
Data Warehouse


Many customers say their rdbms could not deal with big data. Customers mentioned they were using Oracle,SQL Server,DB2 and MySQL.

Mostly used technology :

Cloudera customers say they are using batch processing ,real time ,ETL and visualization tools of hadoop eco system.

Mostly mentioned technologies are :


Surprised to know most of the customers say they are using Hive ,Flume ,Mapreduce and HBase.


Benefits :


Let us see why customers  are choosing Cloudera over others.


Seems Hadoop itself is a cost-effective solution. Though Cloudera is relatively costlier than Mapr and Hortonworks. Still customers say it is cost-effective ,I think they are comparing against data warehouse and other solutions. Cloudera is also well know for training and support services.

Hope this is useful for you. Please check about MapR here.