Search This Blog

Transfering data between hadoop clusters using distcp command.

In this article , We will learn how to transfer data between two hadoop clusters. hadoop distcp command is used to transfer the data between clusters.

One of the main use cases of distcp command is to sync the data between production cluster and backup/DR cluster.  We will learn distcp with some examples.

1)

Connect to source cluster and create a file called numbers under /user/hdfs/distcptest directory in HDFS.

The picture below shows how to create local file named numbers, how to upload it to HDFS directory /user/hdfs/distcptest.





2)

hadoop distcp command takes source paths and destination path as it's arguments.

source path syntax : 

 hdfs://[active-namenode-hostname]:[name-node-port-number]/path/to/hdfs/file


active-namenode-hostname :

We have to specify active name node hostname or ip address. If no HA is enabled, We will not have any active namenode.

We can specify name node hostname or ip address directly.


Port  number  : We need to specify rpc port number of  Name node. By default it is 8020.

In the same way we need to specify active name node hostname or IP address and rpc port numbers of destination cluster.


We use below source path and destination paths.


Source path : hdfs://192.168.1.113:8020/user/hdfs/distcptest/numbers

Destination path : hdfs://192.168.1.115:8020/user/hdfs/target


3)

Create a diretcory called /user/hdfs/target in destination directory and run ls command.

The picture below shows creating /user/hdfs/target folder in destination cluster.




4)

The pictures below shows how to run distcp command and also confirms that destination cluster has the file transferred.

Command run :

 hadoop distcp hdfs://192.168.1.113:8020/user/hdfs/distcptest/numbers hdfs://192.168.1.115:8020/user/hdfs/target






5)
Hardcoding name node ip address and port number is a bad idea because if active name node on that ip address goes down , hadoop distcp command fails.

We need to use nameservice ID of the cluster.

The picture below shows how to use nameservice ID in distcp command. Active namenode ip address and port numbers are replaced with just nameserviceID.

Picture also shows how to get nameservice ID of a cluster.




6) Update or overwrite

What to do if destination cluster's target directory  already has the same files ?


We have two options to choose in the above scenerio. They are update or overwrite.

We can update the destination cluster's directory with new files from source cluster's files using update option.

Or we can simply overwrite destination cluster's files with source cluster's files using overwrite option.


7)  Update example

The following example shows how to update destination cluster's directory with new files in source cluster's directory.

Source cluster's directory has new file called numbersNew , only numbersNew file will be copied to destination cluster's directory /user/hdfs/target.


The destination cluster's target directory has file numberNew with new timestamp.





8) Overwrite Example 

The following pictures show how to use overwrite option in distcp command.






9) Multiple source files

If we need to copy multiple files from source cluster to destination cluster . We need to specify all source files first and target path last in distcp command.

The picture below shows how to transfer multiple files from source cluster to destination cluster.

Source files are highlighted in the picture.





10 ) Multiple source files with -f option


If we have more number of source files , We can specify all of them in a file and we can use that file in distcp command.

distcp command provides -f option to use hdfs file to specify all source files.

The following picture shows how to use -f option.



Let me know if you have any questions on distcp command of HDFS.