Search This Blog

Your Hadoop job might fail due to invalid gzip files

We can use compressed data formats like gzip ,bzip2 and LZO in Hadoop. Gzip is not splittable and is suitable for small files.bzip2 is splittable,though lzo is by default not splittable we can make it splittable.I have seen many people also using gzip files in hadoop. If you have any corrupt gz files ,your job will fail.It is good idea to test gz files for errors before submitting a job.

gzip command comes with t option that will test existing gz file and If no errors found in file ,no message is thrown.

hdfs@cluter10-1:~> gzip -t test.txt.gz


If any errors are found it will throw them on console.

hdfs@cluter10-1:~> gzip -t test.gz

gzip: test.gz: not in gzip format

We can even check files that are available in HDFS for errors.

hdfs@cluter10-1:~> hdfs dfs -cat /data/sample/test.txt.gz|gzip -t

You might have so many gz files in a folder then It is time taking process to test them individually.
We can write small script to check all gz files in a directory for errors . 

First line Take all files from given directory using AWK command that picks 8 occurrence after applying space delimiter. and all files are passed to for loop.

for i in `hdfs dfs -ls gz/*|awk -F" " '{print $8}'`
echo "checking $i"
hdfs dfs -cat $i|gzip -t

Third line prints file that is being checked now. and Fourth line actually checks file for errors. If your hadoop job fails with file related errors and you have gz files , then this article is useful.

Below error might be related to invalid gz files.

createBlockOutputStream Premature EOF: no length prefix available

Hope this small script is useful for you.

Happy Hadooping.