Search This Blog

Enabling debug logs in Apache Hadoop and Hortonworks Data Platform


To trouble shoot hadoop issues we need to have debug logs to get more low level errors.by default debug logs are not enabled in Hortonworks data platform and in plain hadoop also.
In this post,we will discuss how to enable debug logs in HDP and plain hadoop.

1. Modify /var/lib/ambari-server/resources/stacks/HDP/2.0.6/hooks/after-INSTALL/templates/hadoop-env.sh.j2


In HDP,we need to add below line to hadoop-env.sh.j2 to enable debug logs on HDFS services.


export HADOOP_ROOT_LOGGER=DEBUG,console


If you are using Plain hadoop you can directly add above line to /etc/hadoop/conf/hadoop-env.sh file in all nodes and restart HDFS deamons on all nodes.
If you are using Hortonworks Data Platform ,You need to follow below steps.

2. Restart Ambari agents on all nodes


We need to restart ambari agent on all master nodes,data nodes and edge nodes if any. We need to run below command for restarting.

service ambari-agent restart.

Or we can stop and start ambari agent.

service ambari-agent stop
service ambari-agent start


3. Restart Ambari server


We also need to restart ambari server on first master.We can run below command.

service ambari-server restart

Or we can stop and start ambari server.

service ambari-server stop
service ambari-server start

4. Restart HDFS services


In Ambari UI ,Click on HDFS , click service actions drop down and select restart all option.It will ask for confirmation once you confirm, it will restart all HDFS daemons.

Once debug logs are enabled you can check them in name node logs and data node logs.Debug logs consume lot of space you need to disable them once you collect required logs. to disable remove added export command and restart above mentioned services.

Fixing HDFS issues

fsck command scans all files and directories in HDFS for errors and abnormal conditions.  This has to be run by administrator periodically and also name node runs it and fixes most of the issues periodically.

Below is the command syntax and it needs to be run as hdfs user.

hdfs fsck <path>

We can specify root (/) directory to check for errors on complete HDFS or we can specify directory to check for errors in it.

fsck report contains

Displays under-replicated blocks,over-replicated, mis-replicated and corrupt blocks.

Displays number of total files and directories available in HDFS. 

Default replication factor and available average replication factor .

Number of data nodes and number of racks are also displayed in fsck report.

Finally it displays file system status as healthy or corrupt.

fsck final status needs to be healthy, If it is corrupt it needs to be fixed by either administrator or most of issues will be fixed by name node automatically over a period of time.

Below is sample fsck output.


hdfs fsck /

Total size:    466471737404 B (Total open files size: 27 B)

 Total dirs:    917
 Total files:   2042
 Total symlinks:                0 (Files currently being written: 3)
 Total blocks (validated):      4790 (avg. block size 97384496 B) (Total open file blocks (not validated): 3)
  ********************************
  CORRUPT FILES:        9
  MISSING BLOCKS:       9
  MISSING SIZE:         315800 B
  CORRUPT BLOCKS:       9
  ********************************
 Minimally replicated blocks:   4781 (99.81211 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       4274 (89.227554 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.0885177
 Corrupt blocks:                9
 Missing replicas:              4280 (29.944729 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Sun Mar 20 12:52:45 EDT 2016 in 244 milliseconds


The filesystem under path '/' is CORRUPT






under replicated blocks
over replicated blocks


dfs.replication in hdfs-site.xml specifies required number of replicas for a block on cluster. If number of replicas are less than that they are called under replicated blocks. This might happen when data nodes go down. If number of replicas are higher than that they are called over replicated blocks. Over replicated blocks might happen when crashed data nodes come back to normal.

under and over replicated blocks can be addressed with setrep command or name node will fix it after some point of time.

hdfs dfs -setrep -w 3 /path


  1. If you have 2 replicas but required 3 replicas,set replication factor 3. and also if you have 4 replicas but required 3 then also set replication factor 3.
  2. Run balancer ,some times it should also fix it.
  3. Copy under/over replicated file to different location and remove that under/over replicated file. Now rename copied file to original name. You need to be careful to use this trick.If you remove under/over replicated file,Jobs using that file might fail.
After replication factor is set, Use hdfs dfs -ls command on the file that also displays replication factor.

Corrupted blocks

We  should delete corrupted files and we can set appropriate replucate factor after that.
We need to use hdfs fsck / -delete command to delete corrupted files.

We can check corrupted blocks using  hdfs fsck / -list-corruptfileblocks command.

Missing blocks


    --find out which node has missing blocks and check if data node is running or not if possible try with data node restart. we can check data node status from active name node UI or run jps command on all data nodes to check if data node is running or not.

Administrator has to run fsck command reguralry to check hadoop file system for errors. and He has to take necessary actions against errors to avoid data loss.

fsck command has several options,Some of them are

-files

     It displays files of directory.

hdfs@cluster10-1:~> hdfs fsck / -files
/user/oozie/share/lib/sqoop/commons-io-2.1.jar 163151 bytes, 1 block(s):  OK

-blocks

       It displays blocks information.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks
/user/oozie/share/lib/sqoop/oozie-sharelib-sqoop-4.0.0.2.1.2.0-402.jar 7890 bytes, 1 block(s):  OK
0. BP-18950707-10.20.0.1-1404875454485:blk_1073742090_1266 len=7890 repl=3

-locations
               It displays nodes host name where blocks are stored.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks -locations
/user/oozie/share/lib/sqoop/sqoop-1.4.4.2.1.2.0-402.jar 819248 bytes, 1 block(s):  OK
0. BP-18950707-10.20.0.1-1404875454485:blk_1073742091_1267 len=819248 repl=3 [10.20.0.1:50010, 10.20.0.1:50010, 10.20.0.1:50010]

-delete
               It deletes corrupted blocks. We need to run it when we find corrupted blocks in the cluster.


-openforwrite

                      Displays files opened for writing.

-list-corruptfileblocks

                                  It displays only corrupted blocks for given path.

hdfs fsck / -list-corruptfileblocks
Connecting to namenode via http://cluster10-1:50070
The filesystem under path '/' has 0 CORRUPT files


Checking specific information


If we want to see specific type of files in fsck report we need to use grep command on fsck report.
If we want to see only under replicated blocks we need to grep like below.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"

/data/output/_partition.lst 297 bytes, 1 block(s):  Under replicated BP-18950707-10.20.0.1-1404875454485:blk_1073778630_38021. Target Replicas is 10 but found 4 replica(s).

We can replace under replicated with corrupt to see corrupt files.

hdfs@cluster10-1:~> hdfs fsck / -files -blocks -locations|grep -i corrupt
Connecting to namenode via http://cluster10-2:50070
/apps/hbase/data/corrupt <dir>
/data/output/part-r-00004: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778646
/data/output/part-r-00008: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778648
/data/output/part-r-00009: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778649
/data/output/part-r-00010: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778650
/data/output/part-r-00016: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778654
/data/output/part-r-00019: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778659
/data/output/part-r-00020: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778660
/data/output/part-r-00021: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778661
/data/output/part-r-00026: CORRUPT blockpool BP-18950707-10.21.0.1-1404875454485 block blk_1073778663

Status: CORRUPT
  CORRUPT FILES:        9
  CORRUPT BLOCKS:       9
 Corrupt blocks:                9

The filesystem under path '/' is CORRUPT


Above command displays complete information. If we want only file path,we need to use AWK.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"|awk -F " " '{print $1}'
Connecting to namenode via http://cluster10-1:50070
/data/output/_partition.lst

as we have discussed we can set replication factor using setrep command to fix under replicated blocks.when we have so many under-replicated blocks it is difficult to run setrep command on all files.
To avoid manual setting,write all under replicated files path to a file and write a shell script that sets replication factor for all files.

hdfs fsck / -files -blocks -locations |grep -i "Under replicated"|awk -F " " '{print $1}' >>underreplicatedfiles



Happy Hadooping.








Your Hadoop job might fail due to invalid gzip files

We can use compressed data formats like gzip ,bzip2 and LZO in Hadoop. Gzip is not splittable and is suitable for small files.bzip2 is splittable,though lzo is by default not splittable we can make it splittable.I have seen many people also using gzip files in hadoop. If you have any corrupt gz files ,your job will fail.It is good idea to test gz files for errors before submitting a job.




gzip command comes with t option that will test existing gz file and If no errors found in file ,no message is thrown.

hdfs@cluter10-1:~> gzip -t test.txt.gz

hdfs@cluter10-1:~> 


If any errors are found it will throw them on console.

hdfs@cluter10-1:~> gzip -t test.gz

gzip: test.gz: not in gzip format
hdfs@cluter10-1:~> 

We can even check files that are available in HDFS for errors.

hdfs@cluter10-1:~> hdfs dfs -cat /data/sample/test.txt.gz|gzip -t


You might have so many gz files in a folder then It is time taking process to test them individually.
We can write small script to check all gz files in a directory for errors . 

First line Take all files from given directory using AWK command that picks 8 occurrence after applying space delimiter. and all files are passed to for loop.

for i in `hdfs dfs -ls gz/*|awk -F" " '{print $8}'`
do
echo "checking $i"
hdfs dfs -cat $i|gzip -t
done

Third line prints file that is being checked now. and Fourth line actually checks file for errors. If your hadoop job fails with file related errors and you have gz files , then this article is useful.

Below error might be related to invalid gz files.


createBlockOutputStream java.io.EOFException: Premature EOF: no length prefix available

Hope this small script is useful for you.

Happy Hadooping.