Search This Blog

Ambari server installation including Java and Postgres

In this article we will learn how to install Ambari-server. Ambari server depends on Java and RDBMS system. By default, Ambari server installs Postgres RDBMS system to store Ambari data.
We will also see how to install  Java and Postgres and how to configure Postgres for Ambari.

1)  Create Ambari repo

Create an Ambari repository by downloading ambari.repo file.

Command:

wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.6.0.0/ambari.repo -O /etc/yum.repos.d/ambari.repo

This command works on CentOS/RHEL/Oracle Linux.

The picture below shows how to create Ambari repo in CentOS/RHEL/Linux.




2)  Install ambari-server package and  also Postgres  package

After first step , We will see a new repo ambari-2.6.0.0 created .


We will install ambari-server package using yum install command. This command will also resolve dependency Postgres and will also install it.

The picture below shows Ambari-server installed and also it's dependency Postgres.





3)  Ambari-server setup

Once ambari-server is installed , If we type ambari-server, it will show command's options that confirms Ambari-server is successfully installed.









After ambari-server installation using yum command, We need to setup ambari-server to configure Java and Postgres.

We use ambari-server setup command for the same and this command also configures prerequisites for Ambari and Hadoop.

Disable SELinux

SELinux needs to be disabled for Hadoop installation. Ambari server checks whether it is disabled or not. If it is not disabled , setup command will disable SELinux with user's confirmation.




Customize user account

By default ambari-server runs with root user, If we want to change the root user to some other user we can change. It is recommended to use root user.


JDK version

Choose Jdk version to be installed. Ambari setup will shows us all available JDK versions from Oracle and also gives custom JDK option to install other JDK like OpenJDK. Apart from JDK , Ambari server and Hadoop also require JCE. Amabri setup command will also install it.

It is better to choose latest JDK version from Oracle.




Accept the Oracle binary code license agreement.

Before we install Oracle JDK , We need to accept the Oracle binary code license agreement.



Advanced DB configs

We can also configure Postgres as per our wish.  We can configure users and schema as per our wish. It is ok to go with default database configuration. We can just say no(n) in this step.




Ambari server setup complete.

After all these steps ,  We can see ambari-server is successfully setup.

The picture below shows ambari-server is setup successfully.




4) Start the ambari-server .

Now we can start the Ambari-server using ambari-server start command.

The picture below shows ambari-server is successfully started and also shows amabri-server is listening on 8080 port number. 8080 is the default port number for Ambari-server.




5) Ambari server login screen from Browser

We can use ip address where Amabri-server installed and port number to access the ambari GUI from browser.

The picture below shows ambari GUI from browser.



Default login credentials for Ambari are admin/admin.


6) Install HDP by clicking Launch install wizard.

We can login into Ambari GUI using default login credentials admin/admin.

Once we login into Ambari , We are ready to create a HDP (Hortonworks Data Platform ) cluster using Ambari.

We can click on Launch Install Wizard to create the HDP cluster.

The picture below shows Launch Install Wizard button.



In next article we will see how to install HDP (Hortonworks Data Platform) using Ambari. Let me know in the comments if you have any questions about Ambari server installation.


Transfering data between hadoop clusters using distcp command.

In this article , We will learn how to transfer data between two hadoop clusters. hadoop distcp command is used to transfer the data between clusters.

One of the main use cases of distcp command is to sync the data between production cluster and backup/DR cluster.  We will learn distcp with some examples.

1)

Connect to source cluster and create a file called numbers under /user/hdfs/distcptest directory in HDFS.

The picture below shows how to create local file named numbers, how to upload it to HDFS directory /user/hdfs/distcptest.





2)

hadoop distcp command takes source paths and destination path as it's arguments.

source path syntax : 

 hdfs://[active-namenode-hostname]:[name-node-port-number]/path/to/hdfs/file


active-namenode-hostname :

We have to specify active name node hostname or ip address. If no HA is enabled, We will not have any active namenode.

We can specify name node hostname or ip address directly.


Port  number  : We need to specify rpc port number of  Name node. By default it is 8020.

In the same way we need to specify active name node hostname or IP address and rpc port numbers of destination cluster.


We use below source path and destination paths.


Source path : hdfs://192.168.1.113:8020/user/hdfs/distcptest/numbers

Destination path : hdfs://192.168.1.115:8020/user/hdfs/target


3)

Create a diretcory called /user/hdfs/target in destination directory and run ls command.

The picture below shows creating /user/hdfs/target folder in destination cluster.




4)

The pictures below shows how to run distcp command and also confirms that destination cluster has the file transferred.

Command run :

 hadoop distcp hdfs://192.168.1.113:8020/user/hdfs/distcptest/numbers hdfs://192.168.1.115:8020/user/hdfs/target






5)
Hardcoding name node ip address and port number is a bad idea because if active name node on that ip address goes down , hadoop distcp command fails.

We need to use nameservice ID of the cluster.

The picture below shows how to use nameservice ID in distcp command. Active namenode ip address and port numbers are replaced with just nameserviceID.

Picture also shows how to get nameservice ID of a cluster.




6) Update or overwrite

What to do if destination cluster's target directory  already has the same files ?


We have two options to choose in the above scenerio. They are update or overwrite.

We can update the destination cluster's directory with new files from source cluster's files using update option.

Or we can simply overwrite destination cluster's files with source cluster's files using overwrite option.


7)  Update example

The following example shows how to update destination cluster's directory with new files in source cluster's directory.

Source cluster's directory has new file called numbersNew , only numbersNew file will be copied to destination cluster's directory /user/hdfs/target.


The destination cluster's target directory has file numberNew with new timestamp.





8) Overwrite Example 

The following pictures show how to use overwrite option in distcp command.






9) Multiple source files

If we need to copy multiple files from source cluster to destination cluster . We need to specify all source files first and target path last in distcp command.

The picture below shows how to transfer multiple files from source cluster to destination cluster.

Source files are highlighted in the picture.





10 ) Multiple source files with -f option


If we have more number of source files , We can specify all of them in a file and we can use that file in distcp command.

distcp command provides -f option to use hdfs file to specify all source files.

The following picture shows how to use -f option.



Let me know if you have any questions on distcp command of HDFS.



Connecting to Hive database with dynamic service discovery and without dynamic service discovery

We will learn how to use dynamic service discovery in Hive and what are the advantages of dynamic service discovery feature.

First we will see what are the issues faced if we do not use dynamic service discovery in Hive .


1) 

Hive provides two prompts to run hive queries. They are hive prompt and beeline prompt. Hive prompt is deprecated , we need to use beeline prompt.

The below picture show how to connect to beeline prompt.



2) 

Before running Hive queries , We need to establish  a connection to Hive data base. We can establish a connection with dynamic  service discovery and with out dynamic service discovery.

We will see how to connect to database without dynamic service discovery.


We need to know below things to establish a connection to Hive data base .

HiveServer 2 host name : Host name or IP address where hiveserver2  is running. Multiple nodes will be having hiveserver2 instances running. Please use one hostname or IP address.

Database name : Which database in Hive you want to connect.  We will see how to connect to default database.

Port number : Hiveserver2 port number. default value is 10000.

User name : User name for database. We are using hive here.

Password  : Password for database. We are using hive here.


The below picture shows how to connect to hive server2 without dynamic service discovery.

Connection string used : jdbc:hive2://master1:10000/default

jdbc:hive2:// is fixed for all conection strings.

User name and password used : hive and hive



3) 

We will have multiple hiveserver2 instances running on the cluster. Assume that we have connected to database using hiveserver2 running on master1 host,  If hiveserver 2 on master1 is down, Hive queries would fail.

Other problem is we are also increasing the load on one hiveserver2 instance by hard coding it.


In the below picture , first query was successful but second query failed as hiveserver2 went down on master1.





4) 

Hive provides advanced feature called dynamic service discovery to address the above problems.

In dynamic service discovery , Rather than using hiveserver2 host name directlry,We will use zookeeper to connect to hive database.

Zookeeper will always resolve to active hiveserver2 so that your queries never fail.

We need below things to use dynamic service discovery .

Host names and port numbers  where your zookeeper is running . We also call it as Zookeeper ensemble.
We can easily get this value from property hive.zookeeper.quorum in Hive.

zookeeper's default port number is 2181. You can get zookeeper host names from Zookeeper configuration files also.


Specify service discovery mode using serviceDiscoveryMode=zooKeeper .


Specify zookeeper namespace as hivesever2. This is the value of hive.server2.zookeeper.namespace property in Hive.


We are using below connection string .

jdbc:hive2://datanode1:2181,master1:2181,master2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2


The picture below shows how to connect to Hive database using dynamic service discovery.



5)

Many people have typos in dynamic service discovery connection string , Easiest way to get correct connection string to take it from Ambari GUI.

In Ambari -------->  Click on Hive  --------------------> Click Summary ----------------> Click Left arrow icon to copy connection string.




We can directly paste connection string in the terminal once copied from Ambari to avoid typos.

6)

Benefits of Dynamic Service Discovery.

High Availability  :

Dynamic service discovery will always point user queries to active hiveserver2 so that hive queries would never fail.


Load balancing  :

Dynamic service discovery changes hiveserver2 to every new user query in round rpobin fashion so that all hiveserver2 instances would get the same load.

It is always recommended to use dynamic service discovery in production to avoid hiveserver2 crashes.

Let me know if any questions you have.

Checking versions of Hadoop eco system tools


We need to know versions of hadoop technologies while we are trouble shooting any Hadoop issues. This article talks how to check versions of  Hadoop ecosystem technologies.

1) Hadoop version

Hadoop version can be found using below command.

hadoop version




2) Hive version

Hive version can be found using below command.

hive version



3) Pig version

Pig version can be found using below command.

pig version




4) Sqoop version

Sqoop version can be found using below command.

sqoop version



5) Tez version 

Tez version can be found using below rpm command.

rpm -qa|grep -i tez



6) Zookeeper version

Zookeeper version can be found using below rpm command.

rpm -qa|grep -i zookeeper




7) Hortonworks Data Platform  (HDP) version 

HDP version can be found using below command.

hdp-select versions



8) Knox version

Apache Knox version can be found using below rpm command.

rpm -qa|grep -i knox



9) Ranger version

Apache Ranger version can be found using below rpm command.

rpm -qa|grep -i ranger

We can also check version file for ranger version.



10)Checking all  versions from Ambari Web UI

Goto ---> Admin ---> Service accounts and versions



Let me know if you want to know particular Hadoop technology version.


Ambari agent installation

In this article , We will learn how to install Ambari agent for Ambari on different operating systems.

1) Installing ambari agent

We use yum command if operating system is CentOS or Redhat. We use zypper command if operating system is SLES or apt-get command if operating system is Ubuntu.

Commands below needs to be run as root user.

CentOS or RedHat :

yum install ambari-agent

SLES (Suse Linux )

zypper install ambari-agent

Ubuntu

apt-get install ambari-agent


The picture below show how to install ambari-agent on CentOS.






2) Modifying ambari-agent.ini file

We need to inform ambari-agent about ambari server's hostname. hostname property in ambari-agent.ini needs to be updated for the same reason.

The picture below shows hostname property in ambari-agent.ini file.




3) Starting ambari agent

Now ambari-agent needs to be started with ambari-agent start command.





4) Check ambari agent status

Now confirm ambari-agent is running using ambari-agent status command.



HDFS setfacl and getfacl commands examples

In this article , We will learn setfacl and getfacl commands in HDFS.


1) chmod command  can not provide advanced permissions in HDFS.

The following are some use cases where chmod usage is not possible.



  • Providing less/more permissions to one user in a group.



  •  Providing less/more permissions to a specific user 



2) ACL (Access Control Lists) commands setfacl and getfacl provide advanced permissions in HDFS.


3) ACLs in HDFS are disabled by default, We need to enable them by setting below property tp true.

dfs.namenode.acls.enabled

Check how to enable ACLs in Ambari.

4) setfacl command is used to provide advanced permissions in HDFS. getfacl command is used to check ACLs provided on a directory in HDFS.

Type below commands to see commands usage.

hdfs dfs -setfacl

hdfs dfs -getfacl

The pictures below show commands usage .






5)getfacl commands displays ACLs available on an HDFS directory. -R option will display ACLs of a directory and its all sub-directories and all files.

Example :

hdfs dfs -getfacl /data

The picture below shows usage of getfacl command.





6) -m option in setfacl command modifies permissions for an HDFS directory. We can add/remove new ACL/permission to an existing file.

For example :

/data directory has only read access to group members. setfacl  -m option can provide write permissions to one group  member (hive).

The picture below shows how to use -m option.





7) default keyword defines default ACLs on a directory. if any sub directories are created under that directory in future, sub-directories will get default ACLs automatically.

Example :

hdfs dfs -setfacl -m default:user:sqoop:rwx /data

The picture below shows newly created sub directory under /data directory gets default ACLs automatically.



8) + symbol in ls command output indicates a file has ACL defined on it.

The picture below shows plus symbol on  /data directory as /data directory has ACLs defined on it.



9) -k option in setfacl command  will remove default ACLs.

Example :

hdfs dfs -setfacl -k /data

The picture below shows how to remove default ACLs on /data directory in HDFS.




10) -b option in setfacl command removes all ACLs entries except base (user,group and others) ACLs.

Example :

hdfs dfs -setfacl -b /data

The picture below show how to retain base ACLs using -b option.





11) -x option in setfacl command will remove specified ACLs'

Example :

hdfs dfs -setfacl -x user:hive /data

The picture below shows removing user hive permissions on /data directory.



12) --set in setfacl command  replaces all existing ACLs with new ACLs specified.



Limitations

1) ACLs on snapshot directories are not allowed.

2) Only 32 ACLs entries per file allowed  as of now.

3) ACLs information is maintained in memory by namenode. Large number of ACLs will increase load on the Namenode