Search This Blog

Importing data from RDBMS to Hadoop using Apache Sqoop

In this article , We will learn how to import data from RDBMS to HDFS using Apache Sqoop. We will import data from Postgres to HDFS in this article.

1) Check postgres is running.

We will use service command to check the status of Postgres . The command below checks the status of postgres.

service postgresql status

2) List databases;

We will connect to Postgres database and check the  list of databases available in postgres. We can use psql command to start Postgres prompt and we can use \list command to display list of databases in postgres.

3) List tables in a database.

In this step, We will connect to one database and check tables in that dabatase. \c command will allow us to connect to a database in postgres.

We can use \dt command to display tables in current databases.

4) Check the data in the table.

We will select one table and check the data in that table before importing. We will import the data from table called TBLS in hive database, We will check the data in TBLS table using below select query .

select * from public."TBLS";

5) Importing data into HDFS using sqoop import command.  

We use sqoop import command to transfer the data from RDBMS to postgres. We need to use below sqoop import options to import the data .

--connect : This option takes JDBC connection string of an RDBMS .

Syntax : jdbc:://:/

RDBMS-name : We need to specify RDBMS name here. If it is oracle , we specify oracle and if it is postgres , we will specify postgresql here.

database-hostname : We need to specify hostname or ip address where RDBMS is running.

dabatase-port   : We need to use port number on which rdbms system is listening. Postgres's default port number is 5432.

database-name   : We need to specify database name from RDBMS system from which we want to import the data.

--table : Takes table name from which we want to import the data .

--username : We need to specify user name of database .

--password : We need to specify password of database.

--target-dir : We need to specify HDFS directory where we want to import the data from RDBMS system.

--m : specifies number of parallel copies used to transfer the data from RDBMS system to HDFS.

We will run below command to tranfer the data from postgres table called TBLS to HDFS directory /user/hdfs/sqoopout/tbls .

sqoop import --connect jdbc:postgresql:// --table TBLS  --username postgres --password postgrs --target-dir /user/hdfs/sqoopout/tbls -m 1

6) Check data in HDFS folder.

Now we will check the data in HDFS directory to ensure data is transferred successfully.

7) Output directory already exists error.

Directory path mentioned in --target-dir option will be created by sqoop import command. If directory already exists in HDFS, sqoop import command would fail with Output directory already exists error.

8)Delete directory if already exists in HDFS

We can use --delete-target-dir option to import data into HDFS directory even if directory exists in the HDFS. This --delete-target-dir option will remove existing directory and will create the diretcory again. We need to be extra careful while using this option as existing data will be removed.

Apache sqoop simplifies bi-directional data transfer between RDBMS systems and Apache Hadoop.
Let me know if you need any help on above commands.