Search This Blog

Fixing HDFS issues

fsck command scans all files and directories in HDFS for errors and abnormal conditions.  This has to be run by administrator periodically and also name node runs it and fixes most of the issues periodically.

Below is the command syntax and it needs to be run as hdfs user.

hdfs fsck <path>

We can specify root (/) directory to check for errors on complete HDFS or we can specify directory to check for errors in it.

fsck report contains

Displays under-replicated blocks,over-replicated, mis-replicated and corrupt blocks.

Displays number of total files and directories available in HDFS. 

Default replication factor and available average replication factor .

Number of data nodes and number of racks are also displayed in fsck report.

Finally it displays file system status as healthy or corrupt.

fsck final status needs to be healthy, If it is corrupt it needs to be fixed by either administrator or most of issues will be fixed by name node automatically over a period of time.

Below is sample fsck output.

hdfs fsck /

Total size:    466471737404 B (Total open files size: 27 B)

 Total dirs:    917
 Total files:   2042
 Total symlinks:                0 (Files currently being written: 3)
 Total blocks (validated):      4790 (avg. block size 97384496 B) (Total open file blocks (not validated): 3)
  CORRUPT FILES:        9
  MISSING SIZE:         315800 B
 Minimally replicated blocks:   4781 (99.81211 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       4274 (89.227554 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.0885177
 Corrupt blocks:                9
 Missing replicas:              4280 (29.944729 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Sun Mar 20 12:52:45 EDT 2016 in 244 milliseconds

The filesystem under path '/' is CORRUPT

under replicated blocks
over replicated blocks

dfs.replication in hdfs-site.xml specifies required number of replicas for a block on cluster. If number of replicas are less than that they are called under replicated blocks. This might happen when data nodes go down. If number of replicas are higher than that they are called over replicated blocks. Over replicated blocks might happen when crashed data nodes come back to normal.

under and over replicated blocks can be addressed with setrep command or name node will fix it after some point of time.

hdfs dfs -setrep -w 3 /path

  1. If you have 2 replicas but required 3 replicas,set replication factor 3. and also if you have 4 replicas but required 3 then also set replication factor 3.
  2. Run balancer ,some times it should also fix it.
  3. Copy under/over replicated file to different location and remove that under/over replicated file. Now rename copied file to original name. You need to be careful to use this trick.If you remove under/over replicated file,Jobs using that file might fail.
After replication factor is set, Use hdfs dfs -ls command on the file that also displays replication factor.

Corrupted blocks

We  should delete corrupted files and we can set appropriate replucate factor after that.
We need to use hdfs fsck / -delete command to delete corrupted files.

We can check corrupted blocks using  hdfs fsck / -list-corruptfileblocks command.

Missing blocks

    --find out which node has missing blocks and check if data node is running or not if possible try with data node restart. we can check data node status from active name node UI or run jps command on all data nodes to check if data node is running or not.

Administrator has to run fsck command reguralry to check hadoop file system for errors. and He has to take necessary actions against errors to avoid data loss.

fsck command has several options,Some of them are


     It displays files of directory.

hdfs@cluster10-1:~> hdfs fsck / -files
/user/oozie/share/lib/sqoop/commons-io-2.1.jar 163151 bytes, 1 block(s):  OK


       It displays blocks information.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks
/user/oozie/share/lib/sqoop/oozie-sharelib-sqoop- 7890 bytes, 1 block(s):  OK
0. BP-18950707- len=7890 repl=3

               It displays nodes host name where blocks are stored.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks -locations
/user/oozie/share/lib/sqoop/sqoop- 819248 bytes, 1 block(s):  OK
0. BP-18950707- len=819248 repl=3 [,,]

               It deletes corrupted blocks. We need to run it when we find corrupted blocks in the cluster.


                      Displays files opened for writing.


                                  It displays only corrupted blocks for given path.

hdfs fsck / -list-corruptfileblocks
Connecting to namenode via http://cluster10-1:50070
The filesystem under path '/' has 0 CORRUPT files

Checking specific information

If we want to see specific type of files in fsck report we need to use grep command on fsck report.
If we want to see only under replicated blocks we need to grep like below.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"

/data/output/_partition.lst 297 bytes, 1 block(s):  Under replicated BP-18950707- Target Replicas is 10 but found 4 replica(s).

We can replace under replicated with corrupt to see corrupt files.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks -locations|grep -i corrupt
Connecting to namenode via http://cluster10-2:50070
/apps/hbase/data/corrupt <dir>
/data/output/part-r-00004: CORRUPT blockpool BP-18950707- block blk_1073778646
/data/output/part-r-00008: CORRUPT blockpool BP-18950707- block blk_1073778648
/data/output/part-r-00009: CORRUPT blockpool BP-18950707- block blk_1073778649
/data/output/part-r-00010: CORRUPT blockpool BP-18950707- block blk_1073778650
/data/output/part-r-00016: CORRUPT blockpool BP-18950707- block blk_1073778654
/data/output/part-r-00019: CORRUPT blockpool BP-18950707- block blk_1073778659
/data/output/part-r-00020: CORRUPT blockpool BP-18950707- block blk_1073778660
/data/output/part-r-00021: CORRUPT blockpool BP-18950707- block blk_1073778661
/data/output/part-r-00026: CORRUPT blockpool BP-18950707- block blk_1073778663

  CORRUPT FILES:        9
 Corrupt blocks:                9

The filesystem under path '/' is CORRUPT

Above command displays complete information. If we want only file path,we need to use AWK.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"|awk -F " " '{print $1}'
Connecting to namenode via http://cluster10-1:50070

as we have discussed we can set replication factor using setrep command to fix under replicated blocks.when we have so many under-replicated blocks it is difficult to run setrep command on all files.
To avoid manual setting,write all under replicated files path to a file and write a shell script that sets replication factor for all files.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"|awk -F " " '{print $1}' >>underreplicatedfiles

Happy Hadooping.