Hadoop Lessons: August 2017

Enabling and disabling ACLs in HDFS

ACLs commands setfacl and getfacl provide advanced permission management in HDFS. We will learn how to enable/disable ACLs in HDFS using Apache Ambari.

ACLs are disabled by default . We need to modify/add ACLs property to enable ACLs in HDFS.

1)

Search hdfs config for dfs.namenode.acls.enabled property in Ambari , you get no results if property not defined yet.

goto HDFS -------- Configs --------- enter dfs.namenode.acls.enabled in filter box

2)

We need to add ACL property dfs.namenode.acls.enabled if not present already.

goto hdfs ----> configs ------> Advanced ------- > custom hdfs-site ----------> Add property -----> enter dfs.namenode.acls.enabled=true ---- Click Add

3)

Click Save button and name latest configuration for example acl property added.

4)

Restart services those show restart symbol.

The following picture shows restart symbol for HDFS, YARN and Mapreduce2 services.

5)

Search property like in step 1 and confirm property is added.

6)

If property is already there just change property value from false to true to enable ACLs in HDFS.

7)

If ACLs already enabled , You can disable by setting property dfs.namenode.acls.enabled to false using same above steps.

Creating HDFS policy in Ranger user interface

Apache Ranger is a policy based security tool for Hadoop eco system tools. Ranger provides security policies for tools like HDFS, YARN, Hive, Knox, HBase and Storm. In this article we will learn how to create HDFS policy in Apache Ranger UI.

1) Create a folder in HDFS.

We will create an HDFS directory /user/hdfs/ranger to test Ranger HDFS policies. We will be creating directory /user/hdfs/ranger using hdfs user.

hdfs dfs -mkdir /user/hdfs/ranger
hdfs dfs -ls /user/hdfs/ranger

2) Try to access the same directory with Hive user.

If we try to access directory /user/hdfs/ranger using hive user, We will get permission denied error.

[hive@datanode1 ~]$ hdfs dfs -ls /user/hdfs/ranger
ls: Permission denied: user=hive, access=EXECUTE, inode="/user/hdfs/ranger":hdfs:hdfs:drwx------

We will provide access to hive user on this directory /user/hdfs/ranger using Ranger policy.

3) Enable HDFS plugin in Ranger

If HDFS plugin is not enabled, We need to enable it from Ambari .

Goto Ambari UI ---------Click Ranger--------Click Config----Click Ranger plugin ----set HDFS plugin to on

4) Define a new policy in Ranger UI

We will define new policy in Ranger UI to provide read,write and execute access to hive user on /user/hdfs/ranger directory.

Goto Ranger UI ------Click HDFS plugin -------Click add policy-------- Enter policy details ------ Click Add

Policy details are shown in below picture.

5) Access hdfs directory /user/hdfs/ranger using hive user now.

We can now test access on directory /user/hdfs/ranger with hive user.

hdfs dfs -ls /user/hdfs/ranger

Even though hive user does not have any permissions on directory /user/hdfs/ranger , hive user still able to access the folder because of HDFS policy defined in Ranger.

Similary Ranger provides centralized security policies for all hadoop tools.

HDFS REST API example with Knox gateway and without Knox gateway

In this article, We will learn how to use HDFS REST API both with Knox and without Knox API.

As we know Apache Knox is a security technology that provides common REST API to hide REST APIs of all Hadoop eco-system tools. Apache Knox hides REST API details of several technologies like Hadoop , Hive , HBASE and OOZIE.

1) Check the folder status in HDFS using HDFS REST API.

In this step we will learn how to use HDFS REST API , the command below check status of a HDFS directory /user/hdfs/restapitest using HDFS API.

curl "http://master2:50070/webhdfs/v1/user/hdfs/restapitest?user.name=hdfs&op=GETFILESTATUS"

master2 : hostname of Active namenode

50070     : Http port number of active namenode

webhdfs     : HDFS REST API is called webhdfs and It is fixed in URL.

v1    : v1 is the version number of webhdfs. It is also fixed .

GETFILESTATUS : Gives file or folder information from HDFS.

user.name : Takes user name on behalf of whom you are submitting the HDF REST API command

Problems :

Hadoop host names and port numbers are exposed to external world. People can attack on the HDFS cluster easily.

2) Check same folder status using Knox REST API.

In this step we will check status of same HDFS directory /user/hdfs/restapitest using Knox REST API.

Apache Knox URL does not contain any details about namenode hostname and port numbers . It just contains webHDFS word , user name, directory path and operation we are performing like below.

curl -u admin:admin-password -i -v -k "https://datanode1:8442/gateway/default/webhdfs/v1/user/hdfs/restapitest?user.name=hdfs&doas=hdfs&op=GETFILESTATUS"

admin:admin-password : default username and password for default topology in knox.
defualt : topology name , It is also default topology.
8442 : Knox gateway port number defined in gateway.port property.
datanode1 : hostname where Knox gateway is installed.

Apche Knox connects to active name node and port number using it's topology. Apache Knox comes with topology called default. default topology information is stored in /etc/knox/conf/topologies/default.xml file.

The picture shows webHDFS urls stored in default topology.

Advantages :

Hadoop services host names and port numbers are not exposed to external world. Very less probability for external attacks.

I hope it is clear now how Knox protects our Hadoop eco system using REST API.

Starting and stopping Ambari agents

In this article we will learn how to work with Ambari agents.We will learn how to start, stop , restart and some more operations of ambari agents from command line.
If you are not using root user, you have to prefix sudo to all commands listed in below steps.

1) Check the status of Ambari agents.

This command tells us wether Ambari agent is in running state or not. If Ambari agent is in running state , Command also gives us pid of the Ambari agent process.

Command :

ambari-agent status
OR

sudo ambari-agent status

OR

service ambari-agent status

2) Stopping Ambari agent

You can use any one of the commands to stop ambari agent.

sudo ambari-agent stop

OR

ambari-agent stop

OR

service ambari-agent stop

If ambari agent is stopped, Ambari server shows heart beat lost message for the node where ambari-agent is stopped.

3) Starting Ambari agent

You can use any one of the commands to start ambari agent.

ambari-agent start

OR

sudo ambari-agent start

OR

service ambari-agent start

Once ambari agent is started, heart beat issues in Ambari user interface will be fixed.

4) Restarting Ambari agent

You can use any one of the commands to restart ambari agent.

ambari-agent restart

OR

sudo ambari-agent restart

OR

service ambari-agent restart

5) Othe options

Ambari agent process also comes with some more commands, You can check them all using --help option.

ambari-agent --help

Starting and stopping Ambari-server

In this article we will learn how to work with Ambari server.We will learn how to start, stop , restart and some more operations of ambari server from command line. If you are not using root user, you have to prefix sudo to all commands listed in below steps.

1) Check the status of Ambari server.

You can use any one of the commands to check status of ambari server.

ambari-server status

OR

sudo ambari-server status

OR

service ambari-server status

This command tells us wether ambari server is in running state or not. If Ambari server is in running state , Command also gives us pid of the Ambari server process.

2) Stop Ambari server

You can use any one of the commands to stop ambari server.

sudo ambari-server stop

OR

ambari-server stop

OR

service ambari-server stop

3) Start Ambari server

You can use any one of the commands to start ambari server.

ambari-server start

OR

sudo ambari-server start

OR

service ambari-server start

4) Restart Ambari server

You can use any one of the commands to restart ambari server.

ambari-server restart

OR

sudo ambari-server restart

OR

service ambari-server restart

5) Skip database check

While ambari server starting , Ambari server checks consistency of database . If database has any issues, Ambari server fails to start.

We can skip database consistency check while starting ambari server.

Command:

ambari-server start --skip-database-check

6) Other options

Ambari server comes with several other commands, You can use --help option to list all other commands.

ambari-server --help

And every command comes with some options, We can also use --help to see all options of a command.

Example :

ambari-server stop --help

You can also view below video for more details.

How to create Hive table for Parquet data format file ?

In this article we will learn How to create Hive table for parquet file format data. We need to use stored as Parquet to create a hive table for Parquet file format data.

1) Create hive table without location.

We can create hive table for Parquet data without location. And we can load data into that table later.

Command :

create table employee_parquet(name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ',' stored as Parquet ;

2) Load data into hive table .

We can use regular insert query to load data into parquet file format table. Data will be converted into parquet file format implicitely while loading the data.

insert into table employee_parquet select * from employee;

3) Create hive table with location

We can also create hive table for parquet file data with location. Specified location should have parquet file format data.

Command :

create table employee_parquet(name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ','
stored as parquet location '/data/in/employee_parquet' ;

How to create hive table for RC file format ?

In this article we will learn How to create Hive table for RC file format data. We need to use stored as RCFILE to create a hive table for RCFILE format data.

1) Create hive table without location.

We can create hive table for RCFILE data without location. And we can load data into that table later.

Command :

create table employee_rc(name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ',' stored as RCFILE ;

2) Load data into hive table .

We can use regular insert query to load data into RC file format table. Data will be converted into RC file format implicitely while loading the data.

insert into table employee_rc select * from employee;

3) Create hive table with location

We can also create hive table for RC file data with location. Specified location should have RC file format data.

Command :

create table employee_rc(name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ','
stored as RCFILE location '/data/in/employee_rc' ;

How to create Hive table for sequece file format data ?

In this article we will learn How to create Hive table for sequence file format data. We need to use stored as SequenceFile to create a hive table for sequence file format data.

1) Create hive table without location.

We can create hive table for sequence file data without location. And we can load data into that table later.

Command :

create table employee_seq(name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ',' stored as SequenceFile ;

2) Load data into hive table .

We can use normal insert query to load data into sequence file format table. Data will be converted into sequence file format while loading the data.

insert into table employee_seq select * from employee;

3) Create hive table with location

We can also create hive table for sequence file data with location. Specified location should have sequence file format data.

Command :

create table employee_seq(name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ','
stored as SequenceFile location '/data/in/employee_seq' ;

Enabling debug logs in Ambari agents

In this article we will learn how to enabled debug logs in Ambari agent.
Logging properties for Abari agent are available in ambari-agent.ini file under /etc/ambari-agent/conf folder.
We will learn how to modify ambari-agent.ini file to enable debug logs.

1) Check current log level.

By default ambari-agent comes with INFO loglevel That will not expose much internal calls.

Command :

grep loglevel /etc/ambari-agent/conf/ambari-agent.ini

2) Stop Ambari agent

We need to prefix sudo command if we are not running command as a root user.

Command :

ambari-agent stop

OR

sudo ambari-agent stop

3) Modify ambari-agent.ini file

Modify ambari-agent.ini file to replace INFO with DEBUG using VI editor.

4) Start Ambari agent

Command :

ambari-agent start

OR

sudo ambari-agent start

5) Confirm DEBUG logs in log file ambari-agent.log.

command :

tail -f /var/log/ambari-agent/ambari-agent.log

We can also run grep command like in first step to confirm latest loglevel in configuration file.

6) Repeat above steps for all nodes.

We have to repeat above steps for all nodes if we want to collect DEBUG logs from all nodes.

Working with databases in Apache Hive

In this article We will learn how to work on databases in Apache Hive. We will learn how to create, drop, change and use database in Apache Hive.

1) Check existing databases.

We check existing databases in Hive using show databases command. Apache Hive comes with a database called default.

Command :

show databases;

2) Creating a new database;

We can create a new database in Apache Hive using create command.

Command syntax:

create database [if not exist] {database-name};

3) Switching databases

By default , Queries will be run on default databases. If we want to run a query on different database, We have to change the current database.

We use use command to change the current database in Apache Hive.

Command :

use test;

The picture below changes current database to test database and creates a new table called dummy in test database.

4) Drop database

We can drop databases using drop command. We need to drop all tables in the database before droping the database. Otherwise we get database is not empty, one or more tables exist error.

Command :

drop database test;

The picture below shows droping databse called test after deleting it's table dummy.

5) Using database name in a hive query

If we do not specify any database name in hive query, Hive runs it on current database.

We have to prefix database name to table name if we want to run the hive query on a particular database.

The query below runs on test database.

Select count(*) from test.dummy;

Enabling debug logs in Ambari server

Debug logs will help us troubleshoot ambari issues better and faster. Debug logs will contain more number of internal calls those will help us understanding the problem better.

In this article we will learn how to enabled debug logs in Ambari server. Logging properties for Ambari server are available in log4j.porperties file under /etc/ambari-server-conf folder. We will learn how to modify log4j properties to enable debug logs.

1) Check current log level in log4j.properties file

Check log4j.rootLogger property value in log4j.properties file.

Command:

grep rootLogger /etc/ambari-server/conf/log4j.properties

In the above picture rootLogger value shown as INFO,file , We need to change it to DEBUG,file.

INFO is the default log level in Ambari server.

We can also check ambari-server.log file for log level.

Command :

tail -f tail -f /var/log/ambari-server/ambari-server.log

2) Stop ambari-server

Command:

ambari-server stop

service ambari-server stop

3) Modify log4j.properties file

Update log4j.rootLogger property value to DEBUG,info using VI editor.

Command :

vi /etc/ambari-server/conf/log4j.properties

4) Start ambari-server

Command:

ambari-server start

OR

service ambari-server start

5) Check DEBUG logs in ambari-server.log file.

Command :

tail -f /var/log/ambari-server/ambari-server.log

6) Revert loglevel to INFO

Please revert log level to INFO once debug logs collected using same steps. Debug logs take lot of space, can also cause service failures sometimes.

Modifying Ambari server configuration properties

In this article, We will learn how to modify Ambari server configuration properties. Ambari server contains most of it's configuration properties in ambari.properties. ambari.properties file is present under /etc/ambari-server/conf folder.

Now we will learn how to modify pid directory property (pid.dir) in ambari.properties file. pid.dir property contains directory path where ambari server pid file is stored. It's default value is /var/run/ambari-server.

1) Check current value of pid.dir property using grep command.

grep pid.dir /etc/ambari-server/conf/ambari.properties

2) Stop ambari server .

If you are not running the command using root user, You may have to prepend sudo to the command.

Command:

ambari-server stop
service ambari-server stop

3) Modify the property pid.dir using vi editor.

The picture below shows pid.dir property pointing to /var/log/ambari-server.

Commnad:

vi /etc/ambari-server/conf/ambari.properties

4) Start ambari server

Command:

ambari-server start

service ambari-server start

The picture below also shows pid directory now pointing to /var/log/ambari-server path.

You can also run grep command to check pid.dir's latest value like in first step.

You can also view below video for the same information.

Starting HDFS,MAPREDUCE2 and YARN processes manually

We start HDFS,MAPREDUCE2 and YARN services using either Cloudera manager or Apache Ambari if we use CDH or HDP.Many a times Cloudera manager and Apache Ambari do not display complete error message if services fail to start.We can go to log directories and search for startup errors. Searching startup errors in log files is also not easy as log files are huge.

One easy way to find startup errors is starting processes manually from command line to locate errors easily on the console.

In this article we will learn how to start HDFS, MAPREDUCE2 and YARN services manually.

HDFS (Hadoop Distributed File System)

HDFS has processes data node, name node, Zookeeper failover controller and Journal nodes.
We will learn how to start them manually.

Starting datanode manually

We can start data node manually using HDFS datanode command. This needs to be run as hdfs user.

Command:

hdfs datanode &

Starting Namenode manually

We can start name node manually using HDFS namenode command. This also needs to be run as hdfs user.

Command :

HDFS namenode &

Starting Zookeeper failover controller manually

We can start zookeeper failover controller manually using hdfs zkfc command.

Command :

hdfs zkfc &

Starting Journal node manually

We can start journal node manually using hdfs journalnode command. This also needs to be run as hdfs user.

Command :

hdfs journalnode &

MAPREDUCE2

Mapreduce2 has history server process, We can start history server process using mapred historyserver command. It is better to run as mapred user.

Command :

mapred historyserver &

YARN (Yet Another Resource Negotiator)

Yarn has node manager, resource manger and app timeline server processes.

Starting resource manager manually

We can start resource manager using yarn resource manager command. It is recommened to run as yarn user.

Command :

yarn resourcemanager &

Staring nodemanager manually

We can start node manager manually using yarn nodemanager command. This command also needs to be run as yarn user.

Command :

yarn nodemanager &

Staring app timeline server manually

We can start app timeline server manually using yarn timelineserver command. This command also needs to be run as yarn user.

Command :

yarn timelineserver &

These commands are useful only for trouble shooting startup errors, Apache ambari and Cloudera manager may not recognize your services if you start manually using above commands.

Happy Hadooping !!!!

Technology

Search This Blog