Search This Blog

Reusable scripts in hive

If I want to see top ten rows of a table (users) in Hive.
I will write query like below

select * from users limit 10;

will save it to a file in unix ,say topn.q

and will run the query like below

hive -f topn.q

Problem with the above script is.

1.table name and number are hard coded so if we have same requirement on different table or different number ,we have to write the new script or modify the existing one.
because of the above reason we have to write reusable scripts.
we can achieve the same with the help of hiveconf in hive.
hiveconf is handy to substute variables in hive script at runtime.
let us learn how to avoid hard coding in above script by using hiveconf.

change the above script to below.

select * from ${hiveconf:tablename} limit ${hiveconf:number}

save above script in a file, forexample dynatopN.q

now we can pass the table name  and number at the time of running query like below.

hive -hiveconf tablename=users -hiveconf number=10 -f dynatopN.q

even we can change the tablename and number like below

hive -hiveconf tablename=movies -hiveconf number=20 -f dynatopN.q

we should rarely touch the production better to use hiveconf in production scripts also.

To achieve the same in Pig we use -param option while running the script and we use $ symbol inside the script.
If number of parameters we are passing at runtime getting increased, it is hard to maintain such scripts in hive.but in Pig provides one more option -paramfile you can specify the filename where all parameter names and values are maintained.
So pig is more flexible than hive.
This approach is also recommended for production scripts .once query  ran successfully , we should avoid touching it as much as possible.

Bigdata related free courses from coursera

If you are interested to learning bigdata ,below course are good for starting.

It is conducted by IITDelhi professor.
At the end of this course ,you will get what is web intelligence?
what are the different applications of it?
what is the role of bigdata in web intelligence?.what algorithms google is using on bigdata?
what is hadoop?what are the various technologies available in hadoop eco system?
Good part is ,you will be asked to solve some problems.
You require some basics of probability,statistics and rdbms.

This is done by university of washington professor Bill howe.
It covers how to retrieve the data ?and How to classify it?
It even covers machine learning topics supervised learning and unsupervised learning.
visualization is also part of data science.

This is done by stanford university  professor Andrew Ng.
Machine learning is ,making a machine think and act like human this course you learn that with the help of bigdata.
You will also learn different machine learning techniques like supervised learning and unsupervised learning.

This is course same as above but it covers supervised learning in depth and little of unsupervised learning.

good things about these classes are free classes ,you will get practical examples and guest lectures from industry technology leaders.
you will be given home works and practical assignments.You will also get statement of accomplishment depending on the courses and your performance.

Datasets for practicing hadoop

To practise Hadoop you can use below ways to generate the big data (GB),So that you can get the real feel/power of the Hadoop.

From, you can get quarterly full data set of stack exchange so that you can use it while you are practising the hadoop . it contains around 10 GB data. collected different rating data sets ,you can use it for practicing the hadoop.

If you have Hadoop installed on your machine,you can use the following two ways to generate data.

3.hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar randomwriter /random-data

  generates 10 GB data per node under folder /random-data in HDFS.

4.hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar randomtextwriter /random-text-data

 generates 10 GB textual data per node under folder /random-text-data in HDFS.

path of hadoop-examples.jar may change as per your hadoop installation.

5. Amazon provides so many data sets ,you can use them.

6. Check answers of the same question on stackoverflow

7.From University of Waikato ,many data sets available for practicing machine learning.

8.See answers for the similar question on Quora.

If you know any free data sets ,please share in comments