Search This Blog

How to clear Hortonworks administrator (HDPCA) certification?

In Hadoop administration, We have three certifications provided by MapR  (MCCA), Cloudera (CCA) and Hortonworks (HDPCA).I have cleared Hortonworks administrator certification (HDPCA) and also I helped many of my friends to clear the Hortonworks administrator (HDPCA) exam . In that process , I have learned many things about this exam .

I would like to share few tips on how to clear HDPCA exam. May be these tips will help you in clearing the certification.

Exam will provide 2 hours of time to resolve practical tasks on a HDP 2.3 cluster. Exam result does not provide any score or detailed explanation why you have failed. Exam result just provides either you passed or failed. It is better to follow few tips mentioned below.

Start early

Try to come online for exam at least 30  minutes before your actual exam time. Hortonworks will send you exam link before 15 minutes , You need to click the link for further instructions about the exam. Once you click on the link , Your examiner will join in webex like environment and will complete pre-checks for the exam. Pre-checks will include checking candidate photo identity card and checking room environment. You need to come online at least 30 minutes before to complete all these formalities.


Hortonworks provided cluster could be slow. Use your time wisely. Certification will have GUI base tasks and terminal based tasks. If web UI is taking time to load, Go to terminal to perform another task.


Due to slow cluster , You may not be able to complete all tasks in time. Talk to your examiner about cluster slowness, He could give you some more time for completing your other tasks.

Verify answers

If you get fast cluster, You can even complete the tasks much before your total exam time . It is good idea to revisit tasks to verify how did you resolve your tasks.

Read tasks carefully

It is very important to read tasks carefully. Every task will complete information on what is expected from you. For example task will mention you that you need to install node manager on the host master1. if you do not read tasks carefully , you might install node manager on the host master2.

Go to master node first 

When your exam starts , you will given gateway node access. You will also be given all credentials for many hosts in the cluster. Do not forget to ssh to master node of the cluster . Even though it is known thing for many people , some times we forget about that. at least I forgot to ssh to first master node and was trying to resolve some tasks on gateway node itself.

Try mock exam on AWS

Mock exam on AWS is almost same as original exam. It is recommended to try mock exam before taking actual exam.Mock exam will give you  real exam experience, Mock exam can also give you confidence to attend real exam.

The following is the syllabus for HDPCA exam along with tutorials.


Configure a local HDP repository

Install ambari-server and ambari-agent
Ambari server installation
Ambari agent installation

Install HDP using the Ambari install wizard

Add a new node to an existing cluster

Decommission a node
Decommissioning of node managers
Decommissioning of data nodes

Add an HDP service to a cluster using Ambari


Define and deploy a rack topology script

Change the configuration of a service using Ambari

Modifying configurations to enable or disable ACLs

Configure the Capacity Scheduler

Create a home directory for a user and configure permissions

Configure the include and exclude DataNode files

Decommissioning of node managers
Decommissioning of data nodes


Restart an HDP service

View an application’s log file

Configure and manage alerts

Troubleshoot a failed job

Checking versions of Hadoop eco system tools

High Availability

Configure NameNode HA

Configure ResourceManager HA

Copy data between two clusters using distcp

Create a snapshot of an HDFS directory

Recover a snapshot

Configure HiveServer2 HA


Install and configure Knox

Introduction to Knox
Installing and configuring Knox
WebHDFS REST API with knox and without knox

Install and configure Ranger

Install and configure Ranger
Creating HDFS policies in Ranger

Configure HDFS ACLs

Importing data from RDBMS to Hadoop using Apache Sqoop

In this article , We will learn how to import data from RDBMS to HDFS using Apache Sqoop. We will import data from Postgres to HDFS in this article.

1) Check postgres is running.

We will use service command to check the status of Postgres . The command below checks the status of postgres.

service postgresql status

2) List databases;

We will connect to Postgres database and check the  list of databases available in postgres. We can use psql command to start Postgres prompt and we can use \list command to display list of databases in postgres.

3) List tables in a database.

In this step, We will connect to one database and check tables in that dabatase. \c command will allow us to connect to a database in postgres.

We can use \dt command to display tables in current databases.

4) Check the data in the table.

We will select one table and check the data in that table before importing. We will import the data from table called TBLS in hive database, We will check the data in TBLS table using below select query .

select * from public."TBLS";

5) Importing data into HDFS using sqoop import command.  

We use sqoop import command to transfer the data from RDBMS to postgres. We need to use below sqoop import options to import the data .

--connect : This option takes JDBC connection string of an RDBMS .

Syntax : jdbc:://:/

RDBMS-name : We need to specify RDBMS name here. If it is oracle , we specify oracle and if it is postgres , we will specify postgresql here.

database-hostname : We need to specify hostname or ip address where RDBMS is running.

dabatase-port   : We need to use port number on which rdbms system is listening. Postgres's default port number is 5432.

database-name   : We need to specify database name from RDBMS system from which we want to import the data.

--table : Takes table name from which we want to import the data .

--username : We need to specify user name of database .

--password : We need to specify password of database.

--target-dir : We need to specify HDFS directory where we want to import the data from RDBMS system.

--m : specifies number of parallel copies used to transfer the data from RDBMS system to HDFS.

We will run below command to tranfer the data from postgres table called TBLS to HDFS directory /user/hdfs/sqoopout/tbls .

sqoop import --connect jdbc:postgresql:// --table TBLS  --username postgres --password postgrs --target-dir /user/hdfs/sqoopout/tbls -m 1

6) Check data in HDFS folder.

Now we will check the data in HDFS directory to ensure data is transferred successfully.

7) Output directory already exists error.

Directory path mentioned in --target-dir option will be created by sqoop import command. If directory already exists in HDFS, sqoop import command would fail with Output directory already exists error.

8)Delete directory if already exists in HDFS

We can use --delete-target-dir option to import data into HDFS directory even if directory exists in the HDFS. This --delete-target-dir option will remove existing directory and will create the diretcory again. We need to be extra careful while using this option as existing data will be removed.

Apache sqoop simplifies bi-directional data transfer between RDBMS systems and Apache Hadoop.
Let me know if you need any help on above commands.

Enabling namenode HA using Apache ambari

In this article we will learn how to enable high availability for name node.  Nam node high availability has more than one name node. One of the name nodes will be active and it will responsible for serving user requests. Other namenodes will be in stand by mode, Standby namenodes will read meta data of active namenode continuously to be in sync with active namenode . If active namenode goes down , One of standby namenodes will become active to serve user requests that to without failing running jobs.

1) Confirm no HA enabled for Name node

By default, Hortonworks Data Platform setup will include namenode and Secondary namenode in HDFS service. In this scenario if namenode goes down , entire cluster will down and running jobs would fail. To address these issues ,We need to enable High availability (HA) for Name node.

Namenode HA will fail over to other name nodes  automatically to avoid cluster down scenarios .

The picture below confirms we have namenode and Secondary namenode in the cluster.

We need to enable Namenode HA to have active namenode and standby namenode.

2) Click enable Namenode HA under service actions

Click Enable Namenode HA to enable HA for name node. This will open namenode HA wizard.

Go to HDFS -----> click Service Actions -------> click Enable Namenode HA

3) Getting started

We need to enter nameservice ID in the first step. Nameservice ID will resolve to active namenode automatically.

All hadoop clients should use nameservice ID rather than hard coding active namenode.

4)  Select Hosts

Namenode HA wizard will install

  •     additional namenode
  •     3 journal nodes
  •     2 Zookeeper failover controllers

In this step we need to select hosts for additional namenode and journal nodes.

5) Review

This step provides complete information about what wizard is going to install and what configurations wizard is going to add/modify.

We can go back and modify things at this step if we want. Click Next to go to next step.

6) Create checkpoint

In this step , Wizard asks us to perform two things.

  •     entering namenode into safemode.
  •     creating checkpoint for namenode.

**** Please note we need to run given commands only on specified node.

Once these commands are run successful, Next button will be enabled.

7) Configure components

This step performs
  •     Stopping all services
  •     Installing additional namenode on specified host
  •     Installing journal nodes on specified hosts.
  •     Modifying configurations with required properties for Namdenode HA.
  •     Starting journal nodes
  •     Disabling secondary namenode

Click next once all these operations are completed.

8) Initialize journal nodes

This step asks us to run initializeSharedEdits command on first master node.

Once command is run on specified node , click Next

9) Start components

This step performs two things.

  •     Starts zookeeper servers
  •     Starts Namenode

Click Next once two operations completed.

10) Initialize metadata

This step asks us to run two commands on two master nodes.

  •     We need to run formatZK command on first master.
  •     We need to run bootstrapStandby command on second master.

**** Please note we need to run given commands only on specified nodes.

11) Finalize HA setup

This step performs below things.

  •   Starts additional namnode on specified node.
  •   Install Zookeeper Failover controllers on two master nodes.
  •   Starts Zookeeper Failover controllers on two master nodes.
  •   Configures AMS
  •   Deletes secondary namendoe as it is not required in namenode HA.
  •   Stops HDFS
  •   Starts all services

Click Done once above operations are completed.

12 ) Confirm name node HA is enabled

Apache Ambari reloads automatically after enabling HA for namenode, displays active namenode , standby namenode, journal nodes and Zookeeper failover controllers.

The picture below shows all of them.

Let me know if you are struck anywhere while enabling HA for namenode.

Enabling Resource manager HA using Ambari

In this article , We will learn how to enable Resource manager (RM) High availability (HA) using Apache Ambari. In resource manager high availability , Hadoop cluster will have two or more resource managers.One resource manager will be active and other resource managers will be standby.

Active resource manager is responsible for serving user requests. if active resource manager goes down, one of  Standby resource managers will become active resource manager and will serve the user requests without failing running jobs.

By default HDP comes with single resource manager without HA, We need to install one more resource manager and modify/add properties to enable Resource manager HA.

1) Confirm No HA is enabled

Go to Ambari home page ---> click on YARN --->Summary

If no HA is enabled for resource manager, You will see only one resource manager like shown in below picture.

2) Click enable RM HA

Click on enable Resource Manager HA under service actions to initiate enabling RM HA.

Go to Ambari home page ------->click YARN-------> click Service actions -------> click enable Resource Manager HA

The picture below shows enable Resource Manager HA option.

3) Getting started

Ambari opens a new HA wizard that will walk us through enabling  RM HA. We need downtime for  cluster to enable HA.  1 hour  down time is recommended .depending on the cluster size, You can plan for more down time.

This is information step , read and click next.

4) Select Host

We need to have one more resource manager for HA. We will select a node to install one more resource manager  in this step. Click next once a node is selected.

In the picture below , I have selected node master2 to install one more resource manager.

5) Review

This is a review step. We can go back to previous step if we want to modify anything. The picture below show additional resource manager is going to be installed on master2 node.

We can go back and modify if we want to modify that node to something else. Else just click Next.

Some new properties need to be added/modified to YARN to enable resource manager HA. This step will show you all properties to be modified/added.

6) Configure components

This step performs 5 operations .

    Amabri stops all required services to enable HA.

    Installs new resource manager on selected node.

    Adds/modifies YARN configuration for HA

   Adds/modifies HDFS configurations for RM HA

   Start all services.

You can click on operations to see logs.

Once all operations are completed, Click Next.

7) Confirm active and standby

Amabri reloads after completing RM HA , We can see two resource managers. One resource manager will be active and other will be standby.

The picture below shows two resource mangers and we can confirm RM HA is enabled.

Let me know if you have any questions about enabling RM HA using Ambari.

Log files in Hadoop eco system

In this article , We will learn how to check log files of hadoop daemons and how to read log files of  applications and jobs.

1)  Locate log directory in Apache Ambari

First We need to know log directory for hadoop daemons.  We can use Apache ambari to find out log directory for a service. default log directory for any service is /var/log/.  Many companies may not use default log directory , so it is better to know log directory using either Ambari or Cloudera Manager. We can even use Unix command to find log directories.

The picture below shows log directory for HDFS service.

Click on HDFS  ----> Configs -------> type log in filter box.

The picture below shows how to locate log directory for Apache Oozie using grep command of Unix.

2)  Types of log files

log directories will have three types of files .


Logs of running daemons will be available here in this .log file.

The picture below shows logs of running active name node . tail -f command is used to see live logs of a daemon.


.out file will have startup messages of a daemon. These messages will be useful to trouble shoot startup failures of a daemon.

The picture below show startup messages of active resource manegr.


Old log files will have date in their name. by default log rotation is daily, so we will see one log per day.

The picture below shows old ranger log files have date in their name.

3) Command for applications logs

We have seen how to check logs of  hadoop daemons, Now we will learn how to check logs of an application.

We need to have application ID to check logs of an application.

The picture below shows how to get logs of an application id application_1513741463894_0007.

Command used : yarn logs -applicationId application_1513741463894_0007

4) Command for job logs.

We can even check the logs of a hadoop job if we have job id.

The picture below shows how to job logs for job_id job_1513741463894_0001.

Command used : mapred job -logs [job_id]

5) Logs from Resource manager UI.

We can even get application logs from resource manager UI. go to Resource manager UI and click on application ID for which you want to check logs.

The picture below shows logs link in Resource manager UI for application application_1513741463894_0007.

Let me know if you have any questions on how to check log files for any service.

Enabling rack awareness for Hadoop cluster

In this article , We will learn how to enable rack awareness in hadoop clusters. Assume that cluster has large  number of nodes and nodes are placed in more than one rack. If we enable rack awareness , all replicas of block will not be stored in one rack so that we can have at least one replica of block is available for data processing in case of rack failures.

Goal of rack awareness is to improve data availability and decrease network bandwidth.

1) Enabling rack awareness without Apache Ambari.

In old versions of HDP we used to enable rack awareness manually. Latest versions of Apache Ambari  supports rack awareness in GUI.

Check the link on how to enable rack awareness manually , You will not require this as most of the latest versions of Apache Ambari are supporting in GUI.

2) Enabling rack awareness using Apache Ambari

Now we are going to see how to enable rack awareness using Apache Ambari . We have a five node cluster and by default we have got all nodes in default-rack.

Now we will modify rack for datanode3.

go to --> hosts in ambari -----> click on host where you want to modify rack------>go to host actions -----> click set rack

Modify rack name to rack-1 and click OK.

Go back to hosts page in Ambari to see rack name for datanode3 is changed.

You can see that nodes are placed in two different racks they are default-rack and rack-1.

3) Confirm rack awareness enabled

We can also confirm from fsck command and also from hdfs dfsadmin -report  commands.

The picture below is the output of command hdfs fsck / and it shows number of racks is 2.

Let me know if you have any questions on above article.

Creating and configuring home directory for a user in HDFS.

In this article , We will learn how to create home directory in HDFS for a new user.

Every user should have home directory in HDFS if he/she wants to access HDFS. Some hadoop jobs use user's home directory to store intermediate/temporary data . Jobs will fail if no home directory for user.
On Local file system , user's home directory is created under /home directory and On HDFS, User's home directory is created under /user folder.

1) Create a user on local file system 

First we need to create a user on local file system (i.e. Operating System) using useradd command. And user should be created on all nodes in the cluster.

The picture below shows new user nirupam is created and nirupam's home directory in local file system is /home/nirupam.

By default , user does not have a password , You can set password using passwd command if you want.

2) Create a directory in HDFS for new user.

We need to create a directory under /user in HDFS for new user. This directory needs to be created using hdfs user as hdfs user is super user for hadoop cluster.

The picture below shows a new directory is created for nirupam user under /user directory in HDFS.

3) Check the owner 

As new directory is created by hdfs user, hdfs user will be the owner of the directory. We need to change the owner of this directory to new user.

The picture below shows owner of the /user/nirupam directory in HDFS.

4) Change the owner

Change the owner of new directory created in HDFS to new user created in local file system. chown  command can be used to change the owner.

The picture below shows changing the owner of HDFS directory /user/nirupam to nirupam user from hdfs user.

5) Change the permissions 

We need to change the permissions of this newly created so that no other users can have  read,write and execute permissions except owner.

The picture below modifies permissions of /user/nirupam directory to 700 so that only own can have read, write and execute permissions.

6) Test the user's HDFS home directory.

We have successfully created home directory in HDFS for new user. We need to test it now.

We will try to upload a new file to HDFS without specifying destination directory. File will be uploaded to user's home directory if no destination is specified.

The picture below shows new file is uploaded to nirupam's user home directory as destination directory is not specified.

Let me know if you have any questions.

Decommissioning of Node manager in Hadoop cluster

In this article , We will learn how to perform decommissioning of the node managers in Hadoop clusters.

Decommissioning process will ensure running jobs moved to different node managers without failing them.

1) Check Ambari UI

If you are using HDP (Hortonworks Data Platform) , You can check Ambari UI to see how many node managers are present in your cluster.

The picture below shows cluster has 3 node managers. We would like to decommission one node manager from the cluster.

2) Check yarn.resourcemanager.nodes.exclude-path property 

Cluster should have yarn.resourcemanager.nodes.exclude-path property in yarn-site.xml file . If property not present , We should add it.

3) Update exclude file

Update /etc/hadoop/conf/yarn.exclude file with hostname on which you want to perform decommissioning of the node manager.

I have updated the file with master2 hostname to decommission node manager on master2 node.

4) Run refreshNodes command

Run yarn rmadmin -refreshNodes command to initiate decommissioning of nodemanagers.
This command needs to be run as yarn user.

The picture below shows refreshNodes command is run.

5) Check Ambari UI 

Login into Amabri GUI  and click on YARN service to check decommissioned nodemanagers.

The picture below show 1 nodemanager is decommissioned, I have highlighted it in yellow.

Trouble shooting:

Check logs of node manager which you are decommissioning, logs of active resource manager and also logs of the active namenode if decommissioning of node managers is not working.