Search This Blog

HDFS setfacl and getfacl commands examples

In this article , We will learn setfacl and getfacl commands in HDFS.


1) chmod command  can not provide advanced permissions in HDFS.

The following are some use cases where chmod usage is not possible.



  • Providing less/more permissions to one user in a group.



  •  Providing less/more permissions to a specific user 



2) ACL (Access Control Lists) commands setfacl and getfacl provide advanced permissions in HDFS.


3) ACLs in HDFS are disabled by default, We need to enable them by setting below property tp true.

dfs.namenode.acls.enabled

Check how to enable ACLs in Ambari.

4) setfacl command is used to provide advanced permissions in HDFS. getfacl command is used to check ACLs provided on a directory in HDFS.

Type below commands to see commands usage.

hdfs dfs -setfacl

hdfs dfs -getfacl

The pictures below show commands usage .






5)getfacl commands displays ACLs available on an HDFS directory. -R option will display ACLs of a directory and its all sub-directories and all files.

Example :

hdfs dfs -getfacl /data

The picture below shows usage of getfacl command.





6) -m option in setfacl command modifies permissions for an HDFS directory. We can add/remove new ACL/permission to an existing file.

For example :

/data directory has only read access to group members. setfacl  -m option can provide write permissions to one group  member (hive).

The picture below shows how to use -m option.





7) default keyword defines default ACLs on a directory. if any sub directories are created under that directory in future, sub-directories will get default ACLs automatically.

Example :

hdfs dfs -setfacl -m default:user:sqoop:rwx /data

The picture below shows newly created sub directory under /data directory gets default ACLs automatically.



8) + symbol in ls command output indicates a file has ACL defined on it.

The picture below shows plus symbol on  /data directory as /data directory has ACLs defined on it.



9) -k option in setfacl command  will remove default ACLs.

Example :

hdfs dfs -setfacl -k /data

The picture below shows how to remove default ACLs on /data directory in HDFS.




10) -b option in setfacl command removes all ACLs entries except base (user,group and others) ACLs.

Example :

hdfs dfs -setfacl -b /data

The picture below show how to retain base ACLs using -b option.





11) -x option in setfacl command will remove specified ACLs'

Example :

hdfs dfs -setfacl -x user:hive /data

The picture below shows removing user hive permissions on /data directory.



12) --set in setfacl command  replaces all existing ACLs with new ACLs specified.



Limitations

1) ACLs on snapshot directories are not allowed.

2) Only 32 ACLs entries per file allowed  as of now.

3) ACLs information is maintained in memory by namenode. Large number of ACLs will increase load on the Namenode


Exploring snapshots in HDFS

HDFS snapshot is saved copy of an existing directory. Snapshots will be useful for restoring the corrupt data. In this article we will learn how to manage HDFS snapshots.

Practice below commands to get practical understanding of HDFS snapshots.


1) Create a local file with sample numbers.



2) Create a folder on hdfs and upload local file to HDFS directory

The following commands create a folder called numbers in HDFS directory /user/hdfs and upload local file called numbers to HDFS directory.

hdfs dfs -mkdir /user/hdfs/numbers
hdfs dfs -ls /user/hdfs/numbers
hdfs dfs -put numbers /user/hdfs/numbers







3) Try to create a snapshot on  an HDFS directory

Snapshots can not be created on a folder directly. We need to enable snapshot on the directory before creating snapshots on it.

Directory is not a snapshottable directory error is thrown if snapshots are not enabled.

hdfs dfs -createSnapshot /user/hdfs/numbers





4) Allow snapshots and create snapshots

allowSnapshot command enables snapshots on a HDFS directory.

The folowing commands first enable snapshot on /user/hdfs/numbers and create snapshot on the same.


hadoop dfsadmin -allowSnapshot /user/hdfs/numbers
hdfs dfs -createSnapshot /user/hdfs/numbers


5) List snapshots using ls command

We can check snapshots in a directory using ls command. Snapshots of a directory will be stored in .snapshot directory of the folder.

 hdfs dfs -ls /user/hdfs/numbers/.snapshot
  hdfs dfs -ls /user/hdfs/numbers/.snapshot/s20170902-133455.787

The picture below shows HDFS directory /user/hdfs/numbers has a file called numbers that is also saved in snapshot diretcory /user/hdfs/numbers/.snapshot/s20170902-133455 .

If numbers file in /user/hdfs/numbers is corrupted , We can restore numbers file from /user/hdfs/numbers/.snapshot/s20170902-133455 directory.




6) List snapshottable directories in entire HDFS

lsSnapshottableDir  command lists all HDFS directory those have snapshots enabled.

hdfs lsSnapshottableDir




8) Create  snapshot with a specific name

By default snapshots are created with timestamp as a folder name. We can even name snapshot of directory at the time of creatinng snapshots.

The command below creates a snapshot called secondSS on HDFS directory /user/hdfs/numbers.

hdfs dfs -createSnapshot /user/hdfs/numbers secondSS




9) Delete file from HDFS folder

The command below deletes file numbers from directory /user/hdfs/numbers to see how to restore it.

hdfs dfs -rm /user/hdfs/numbers/numbers



10) Restore snapshot from HDFS directory

Snapshots will be restored using HDFS command cp.

 hdfs dfs -cp /user/hdfs/numbers/.snapshot/secondSS/numbers /user/hdfs/numbers




11) Try to disable snapshots

We need to delete all snapshots before disabling snapshots on a HDFS directory.

hdfs dfsadmin -disallowSnapshot /user/hdfs/numbers





12) Delete snapshots and disallow snapshot

The commands below first delete all snapshots before disabling snapshots.

 hdfs dfs -deleteSnapshot /user/hdfs/numbers secondSS
 hdfs dfsadmin -disallowSnapshot /user/hdfs/numbers




13) Rename a snapshot

renameSnapshot  command is used to change the name of a snapshot.

  hdfs dfs -renameSnapshot /user/hdfs/numbers secondSS thirdSS



Hope you migh learned HDFS snapshots with this article.

Happy Hadooping.

Enabling and disabling ACLs in HDFS


ACLs commands setfacl and getfacl provide advanced permission management in HDFS. We will learn how to enable/disable ACLs in HDFS using Apache Ambari.

ACLs are disabled by default . We need to modify/add ACLs property to enable ACLs in HDFS.

1)

Search hdfs config for dfs.namenode.acls.enabled property in Ambari , you get no results if property not defined yet.

goto HDFS  -------- Configs  --------- enter dfs.namenode.acls.enabled in filter box





2)

We need to add  ACL property dfs.namenode.acls.enabled if not present already.

goto hdfs  ---->  configs  ------> Advanced ------- > custom hdfs-site ---------->  Add property -----> enter dfs.namenode.acls.enabled=true   ---- Click Add





3)

Click Save button and name latest configuration for example acl property added.

4)

Restart  services those show restart symbol.

The following picture shows restart symbol for HDFS, YARN and Mapreduce2 services.





5)

Search property like in step 1 and confirm property is added.

6)

If property is already there just change property value from false to true to enable ACLs in HDFS.

7)

If ACLs already enabled , You can disable by setting property dfs.namenode.acls.enabled to false using same above steps.



Creating HDFS policy in Ranger user interface


Apache Ranger is a policy based security tool for Hadoop eco system tools. Ranger provides security policies for tools like  HDFS, YARN, Hive, Knox, HBase and Storm. In this article we will learn how to create HDFS policy in Apache Ranger UI.


1)  Create a folder in HDFS.

We will create an HDFS directory /user/hdfs/ranger to test Ranger HDFS policies.  We will be creating directory /user/hdfs/ranger using hdfs user.

hdfs dfs -mkdir /user/hdfs/ranger
hdfs dfs -ls /user/hdfs/ranger






2) Try to access the same directory with Hive user.

If we try to access directory /user/hdfs/ranger using hive user, We will get permission denied error.

[hive@datanode1 ~]$  hdfs dfs -ls /user/hdfs/ranger
ls: Permission denied: user=hive, access=EXECUTE, inode="/user/hdfs/ranger":hdfs:hdfs:drwx------

We will provide access to hive user on this directory  /user/hdfs/ranger using Ranger policy.

3) Enable HDFS plugin in Ranger

If HDFS plugin is not enabled, We need to enable it from Ambari .

Goto Ambari UI  ---------Click Ranger--------Click Config----Click Ranger plugin ----set HDFS plugin to on




4) Define a new policy in Ranger UI

We will define new policy in Ranger UI to provide read,write and execute access to hive user on  /user/hdfs/ranger  directory.


Goto Ranger UI ------Click HDFS plugin -------Click add policy-------- Enter policy details ------ Click Add

Policy details are shown in below picture.





5) Access hdfs directory /user/hdfs/ranger  using hive user now.

We can now test access on directory /user/hdfs/ranger with hive user.


 hdfs dfs -ls /user/hdfs/ranger


Even though hive user does not have any permissions on directory /user/hdfs/ranger , hive user still able to access the folder because of HDFS policy defined in Ranger.

Similary Ranger provides centralized security policies for all hadoop tools.

HDFS REST API example with Knox gateway and without Knox gateway


In this article, We will learn how to use HDFS REST API both with Knox and without Knox API.

As we know Apache Knox is a security technology that provides common REST API to hide REST APIs of all Hadoop eco-system tools. Apache Knox hides REST API details of several technologies like Hadoop , Hive , HBASE and OOZIE.


1) Check  the folder  status in HDFS using HDFS REST API.


In this step we will learn how to use HDFS REST API , the command below check status of a HDFS directory   /user/hdfs/restapitest using HDFS API.

 curl  "http://master2:50070/webhdfs/v1/user/hdfs/restapitest?user.name=hdfs&op=GETFILESTATUS"


master2                      : hostname of Active namenode

50070                         : Http port number of active namenode

webhdfs                     : HDFS REST API is called webhdfs and It is fixed in URL.

v1                               : v1 is the version number of webhdfs.  It is also fixed .

GETFILESTATUS    : Gives file or folder information from HDFS.

user.name                   : Takes user name on behalf of whom you are submitting the HDF REST API command


Problems :


  • Hadoop host names and port numbers are  exposed to external world.  People can attack on the HDFS cluster easily.



2) Check same folder status using Knox REST API.

In this step we will check status of same HDFS directory /user/hdfs/restapitest using Knox REST API.

Apache Knox URL does not contain any details about namenode hostname and port numbers . It just contains webHDFS word , user name, directory path and operation we are performing like below.

curl -u admin:admin-password -i -v -k "https://datanode1:8442/gateway/default/webhdfs/v1/user/hdfs/restapitest?user.name=hdfs&doas=hdfs&op=GETFILESTATUS"


admin:admin-password : default username and password for default topology in knox.
defualt                             : topology name , It is also default topology.
8442                                 : Knox gateway port number defined in gateway.port property.
datanode1                       : hostname where Knox gateway is installed.

Apche Knox connects to active name node and port number using it's topology. Apache Knox comes with topology called default. default topology information is stored in /etc/knox/conf/topologies/default.xml file.

The picture shows webHDFS urls stored in default topology.





Advantages :


  • Hadoop services host names and port numbers are not exposed to external world.  Very less probability for external attacks.
I hope it is clear now how Knox protects our Hadoop eco system using REST API.

Starting and stopping Ambari agents

In this article we will learn how to work with Ambari agents.We will learn how to start, stop , restart and some more operations of ambari agents from command line.
 If you are not using root user, you have to prefix sudo to all commands listed in below steps.

 1) Check the status of Ambari agents.

This command tells us wether Ambari agent is in running state or not. If  Ambari agent is in running state , Command also gives us pid of the Ambari agent process.

Command :

ambari-agent status
OR

sudo ambari-agent status

OR

service ambari-agent status



2) Stopping  Ambari agent

You can use any one of the commands to stop ambari agent.

sudo ambari-agent stop

OR

ambari-agent stop

OR

service ambari-agent stop



If ambari agent is stopped, Ambari server shows heart beat lost message for the node where ambari-agent is stopped.


3) Starting  Ambari agent

You can use any one of the commands to start ambari agent.

ambari-agent start

OR

sudo ambari-agent start

OR

service ambari-agent  start



Once ambari agent is started, heart beat issues in Ambari user interface will be fixed.

4) Restarting Ambari agent

You can use any one of the commands to restart ambari agent.

ambari-agent restart

OR

sudo ambari-agent restart

OR

service ambari-agent restart






5) Othe options

Ambari agent process also comes with some more commands, You can check them all using --help option.

ambari-agent --help



Starting and stopping Ambari-server

In this article we will learn how to work with Ambari server.We will learn how to start, stop , restart and some more operations of ambari server from command line. If you are not using root user, you have to prefix sudo to all commands listed in below steps.

1) Check the status of Ambari server.

You can use any one of the commands to check status of ambari server.

ambari-server status

OR

sudo ambari-server status

OR

service ambari-server status

This command tells us wether ambari server is in running state or not. If Ambari server is in running state , Command also gives us pid of the Ambari server process.



2) Stop Ambari server

You can use any one of the commands to stop ambari server.

sudo ambari-server stop

OR

ambari-server stop

OR

service ambari-server stop



3) Start Ambari server

You can use any one of the commands to start ambari server.

ambari-server start

OR

sudo -server start

OR

service ambari-server start



4) Restart Ambari server

You can use any one of the commands to restart ambari server.

ambari-server restart

OR

sudo ambari-server restart

OR

service ambari-server restart




5) Skip database check

While ambari server starting  , Ambari server checks consistency of database . If database has any issues, Ambari server fails to start.

We can skip database consistency check while starting ambari server.

Command:

 ambari-server start --skip-database-check




6) Other options 

Ambari server comes with several other commands, You can use --help option to list all other commands.

ambari-server --help



And every command comes with some options, We can also use --help to see all options of a command.

Example :

ambari-server stop --help




How to create Hive table for Parquet data format file ?

In this article we will learn How to create Hive table for parquet file format data. We  need to use stored as Parquet  to create a hive table for Parquet file  format data.


1) Create  hive table without location.

We can create hive table for Parquet data without location.  And we can load data into that table later.

Command :

create table employee_parquet(name string,salary int,deptno int,DOJ date)  row format delimited fields terminated by ',' stored as Parquet ;




2) Load data into hive table .

We can use regular insert query to load data into parquet file format table. Data will be converted into parquet file format implicitely while loading the data.

 insert into table employee_parquet select * from employee;



3) Create hive table with location

We can  also create hive table for parquet file data  with location. Specified location should have parquet file format data.

Command :

create table employee_parquet(name string,salary int,deptno int,DOJ date)  row format delimited fields terminated by ',' 
stored as parquet location '/data/in/employee_parquet' ;



How to create hive table for RC file format ?


In this article we will learn How to create Hive table for RC file format data. We  need to use stored as RCFILE to create a hive table for RCFILE format data.


1) Create  hive table without location.

We can create hive table for RCFILE data without location.  And we can load data into that table later.

Command :

create table employee_rc(name string,salary int,deptno int,DOJ date)  row format delimited fields terminated by ',' stored as RCFILE ;



2) Load data into hive table .

We can use regular insert query to load data into RC file format table. Data will be converted into RC file format implicitely while loading the data.

 insert into table employee_rc select * from employee;



3) Create hive table with location

We can  also create hive table for RC file data  with location. Specified location should have RC file format data.

Command :

create table employee_rc(name string,salary int,deptno int,DOJ date)  row format delimited fields terminated by ',' 
stored as RCFILE location '/data/in/employee_rc' ;


How to create Hive table for sequece file format data ?

In this article we will learn How to create Hive table for sequence file format data. We  need to use stored as SequenceFile to create a hive table for sequence file format data.


1) Create  hive table without location.

We can create hive table for sequence file data without location.  And we can load data into that table later.

Command :

create table employee_seq(name string,salary int,deptno int,DOJ date)  row format delimited fields terminated by ',' stored as SequenceFile ;



2) Load data into hive table .

We can use normal insert query to load data into sequence file format table. Data will be converted into sequence file format while loading the data.

 insert into table employee_seq select * from employee;



3) Create hive table with location

We can  also create hive table for sequence file data  with location. Specified location should have sequence file format data.

Command :

create table employee_seq(name string,salary int,deptno int,DOJ date)  row format delimited fields terminated by ',' 
stored as SequenceFile location '/data/in/employee_seq' ;