Search This Blog

Prerequisite for learning Apache Hadoop

Apache Hadoop is a framework, that is used for processing large data sets on commodity hardware.
It has two core modules ,HDFS and Mapreduce. HDFS is used for data storage and Mapreduce is used for data processing. Hadoop has became de facto standard for processing large data sets.
As it is widely used in the companies ,now a days, every body is trying to learn Apache Hadoop .
In this article ,I will discuss prerequisites for learning Hadoop.

Hadoop eco system has so many tools and growing fast day by day.Main tools are HDFS and Mapreduce. Both are written in Java. Many technologies are written on top of Mapreduce. For example Apache Hive ,Apache Pig,Cascading and Crunh etc..These technologies also developed using Java. Apache HBase is a NOSQL Database .Apart from these ,we also have small tools like Apache SQOOP and Apache Oozie. These too written using Java.

Below Skill set is required for learning Apache Hadoop.

Programming Language:

Most of the technologies in Hadoop Eco system are written using Java . Hadoop also has support for several programming languages.  We can use AWK and SED as part of Streaming. C/C++ can be used as part of Pipes. Python can also be used for data processing ,again Streaming can support it.
Java is widely used in Hadoop.Python is also often used . Scala is also used after of success of Spark.


Apache Hive query language is almost same as ANSI SQL. Apache Pig has many commands similar to sql .For example ,order by ,group by and joins also available in Apache Pig. Similarly Same commands are also available in Casscading. of course they are Java classes in it.HBase too has some commands similar to SQL commands.Not only Hadoop eco system tools but also many big data tools provide SQL interface so that people can easily learn it.Cassandra is also same as SQL.

Operating System:

You need to have good OS skills. Most of the time, Unix based operating systems are used in production. So if you know any Unix based OS ,Your day to day life will become easy.If you also know shell script You can achieve good productivity.


 Apache Sqoop has simple commands ,One can easily learn it. Apache Oozie applications are written using XML files. and almost every technology comes with REST API. and some REST APIs give JSON output. as all of these tools are build for parallel computing, It is better have an understanding about different parallel computing technologies.  Last but not least ,one needs to have good debugging/trouble shooting skills to resolve one's day to day issues . Otherwise you may spend several days on a single problem.
Feel free to contact me if you have any other questions.