Hadoop Lessons: 2018

In Hadoop administration, We have three certifications provided by MapR (MCCA), Cloudera (CCA) and Hortonworks (HDPCA).I have cleared Hortonworks administrator certification (HDPCA) and also I helped many of my friends to clear the Hortonworks administrator (HDPCA) exam . In that process , I have learned many things about this exam .

I would like to share few tips on how to clear HDPCA exam. May be these tips will help you in clearing the certification.

Exam will provide 2 hours of time to resolve practical tasks on a HDP 2.3 cluster. Exam result does not provide any score or detailed explanation why you have failed. Exam result just provides either you passed or failed. It is better to follow few tips mentioned below.

Start early

Try to come online for exam at least 30 minutes before your actual exam time. Hortonworks will send you exam link before 15 minutes , You need to click the link for further instructions about the exam. Once you click on the link , Your examiner will join in webex like environment and will complete pre-checks for the exam. Pre-checks will include checking candidate photo identity card and checking room environment. You need to come online at least 30 minutes before to complete all these formalities.

Cluster

Hortonworks provided cluster could be slow. Use your time wisely. Certification will have GUI base tasks and terminal based tasks. If web UI is taking time to load, Go to terminal to perform another task.

Time

Due to slow cluster , You may not be able to complete all tasks in time. Talk to your examiner about cluster slowness, He could give you some more time for completing your other tasks.

Verify answers

If you get fast cluster, You can even complete the tasks much before your total exam time . It is good idea to revisit tasks to verify how did you resolve your tasks.

Read tasks carefully

It is very important to read tasks carefully. Every task will complete information on what is expected from you. For example task will mention you that you need to install node manager on the host master1. if you do not read tasks carefully , you might install node manager on the host master2.

Go to master node first

When your exam starts , you will given gateway node access. You will also be given all credentials for many hosts in the cluster. Do not forget to ssh to master node of the cluster . Even though it is known thing for many people , some times we forget about that. at least I forgot to ssh to first master node and was trying to resolve some tasks on gateway node itself.

Try mock exam on AWS

Mock exam on AWS is almost same as original exam. It is recommended to try mock exam before taking actual exam.Mock exam will give you real exam experience, Mock exam can also give you confidence to attend real exam.

The following is the syllabus for HDPCA exam along with tutorials.

Installation

Configure a local HDP repository

Install ambari-server and ambari-agent
Ambari server installation
Ambari agent installation

Install HDP using the Ambari install wizard

Add a new node to an existing cluster

Decommission a node
Decommissioning of node managers
Decommissioning of data nodes

Add an HDP service to a cluster using Ambari

Configuration

Define and deploy a rack topology script

Change the configuration of a service using Ambari

Modifying configurations to enable or disable ACLs

Configure the Capacity Scheduler

Create a home directory for a user and configure permissions

Configure the include and exclude DataNode files

Decommissioning of node managers
Decommissioning of data nodes

Troubleshooting

Restart an HDP service

View an application’s log file

Configure and manage alerts

Troubleshoot a failed job

Checking versions of Hadoop eco system tools

High Availability

Configure NameNode HA

Configure ResourceManager HA

Copy data between two clusters using distcp

Create a snapshot of an HDFS directory

Recover a snapshot

Configure HiveServer2 HA

Security

Install and configure Knox

Introduction to Knox
Installing and configuring Knox
WebHDFS REST API with knox and without knox

Install and configure Ranger

Install and configure Ranger
Creating HDFS policies in Ranger

Configure HDFS ACLs

In this article, We will learn how to import data from RDBMS to HDFS using Apache Sqoop. We will import data from Postgres to HDFS in this article.

1) Check postgres is running.

We will use the service command to check the status of Postgres. The command below checks the status of Postgres.

service PostgreSQL status

2) List databases;

We will connect to Postgres database and check the list of databases available in postgres. We can use psql command to start Postgres prompt and we can use \list command to display list of databases in postgres.

3) List tables in a database.

In this step, We will connect to one database and check tables in that dabatase. \c command will allow us to connect to a database in postgres.

We can use \dt command to display tables in current databases.

4) Check the data in the table.

We will select one table and check the data in that table before importing. We will import the data from table called TBLS in hive database, We will check the data in TBLS table using below select query .

select * from public."TBLS";

5) Importing data into HDFS using sqoop import command.

We use sqoop import command to transfer the data from RDBMS to postgres. We need to use below sqoop import options to import the data .

--connect : This option takes JDBC connection string of an RDBMS .

Syntax : jdbc:://:/

RDBMS-name : We need to specify RDBMS name here. If it is oracle , we specify oracle and if it is postgres , we will specify postgresql here.

database-hostname : We need to specify hostname or ip address where RDBMS is running.

dabatase-port : We need to use port number on which rdbms system is listening. Postgres's default port number is 5432.

database-name : We need to specify database name from RDBMS system from which we want to import the data.

--table : Takes table name from which we want to import the data .

--username : We need to specify user name of database .

--password : We need to specify password of database.

--target-dir : We need to specify HDFS directory where we want to import the data from RDBMS system.

--m : specifies number of parallel copies used to transfer the data from RDBMS system to HDFS.

We will run below command to tranfer the data from postgres table called TBLS to HDFS directory /user/hdfs/sqoopout/tbls .

sqoop import --connect jdbc:postgresql://192.168.1.113:5432/hive --table TBLS --username postgres --password postgrs --target-dir /user/hdfs/sqoopout/tbls -m 1

6) Check data in HDFS folder.

Now we will check the data in HDFS directory to ensure data is transferred successfully.

7) Output directory already exists error.

Directory path mentioned in --target-dir option will be created by sqoop import command. If directory already exists in HDFS, sqoop import command would fail with Output directory already exists error.

8)Delete directory if already exists in HDFS

We can use --delete-target-dir option to import data into HDFS directory even if directory exists in the HDFS. This --delete-target-dir option will remove existing directory and will create the diretcory again. We need to be extra careful while using this option as existing data will be removed.

Apache sqoop simplifies bi-directional data transfer between RDBMS systems and Apache Hadoop.

Let me know if you need any help on above commands.

Technology

Search This Blog

How to clear Hortonworks administrator (HDPCA) certification?