Search This Blog

Cascading for your next hadoop project



Cascading is a platform for developing data applications on hadoop.It can process all types of data like structured ,unstructured and semi structured data. It can be used for most of the business analytics requirements.It is written in java on top of mapreduce.It also has different versions supporting python,ruby,clojure and scala.
in this article , I would like share few benefits if you use cascading in  your big data projects.




1. Need not think in terms of keys and values


Biggest problem of using mapreduce is thinking in terms of keys and values apart from business logic.
Map reduce is very low level API,I feel, most fo times,developing data applications using mapreduce  is same as studying mechanical engineering for learning driving.that is the reason mapreduce based tools like hive and pig are widely adopted .for the same reason ,Cascading can also be used.you need not think in terms of key value programming paradigm,you can focus on business logic.



2. Pure java


When we use mapreduce tools like hive or  pig,if you want to build complex business logic ,again you have to depend on UDFs which requires some programming languages like java or python.so rather than using Hive and  java or pig and java for your project,you can depend on single tool like cascading so you can write your entire code in one programming language like java.


3. Rapid application development


In mapreduce ,you will write sparate program for mapper , separate program for reducer and one driver program,so you will write more lines of code.
in cascading ,you will write only business logic and you will have less number of lies of code.as you will also have built in functions ,you can rapidly develop data applications.in mparreduce you dont have any concept of built in analytical functions and you end up writing lot of code.



4.Customizable


Though It is built on top of Mapreduce ,it allows you to customize API as per user requirements.


5.Easy Integration


We have many technologies in big data space like hadoop,hive,sqoop,oozie,cassandra,hbase,solr,elasticsearch,teradata,splunk and rdbms systems like oracle,mysql and postgres.fortunately cascading provides easy facility to integrate with all of them.
I mean integration with other technologies  is also easy.




6. Proven in production


It is being used by many companies including Twitter.



7.Very good documentation


Cascading provides good documentation in terms of tutorials and user guide.
you can easily start learning the same,It might not take more than one week to start your own application.





8.Testable code



Last but not least ,if we go for hive or Pig you many not able test your code but Cascading is also suitable for test driven developments.
you can confidently deliver quality applications using cascading.

With all these benefits ,I think you can easily consider Cascading for your next hadoop project.


17 comments:

  1. Thanks for suggesting good list. I appreciate your work this is really helpful for everyone. Get more information at Python Tutorial. Keep posting such useful information.

    ReplyDelete
  2. I found your blog on Google and read a few of your other posts. I just added you to my Google News Reader. You can also visit Caching In Python for more Coding Dolphin related information and knowledge, Keep up the great work Look forward to reading more from you in the future.

    ReplyDelete