(Article) First Steps with Pig and Hadoop
Article: First Steps with Pig and Hadoop
-
Today I wanted to do some practical work and try out Yahoo’s “Pig Latin: A Not-So-Foreign Language for Data Processing” (PDF). Pig is a dataflow programming environment for processing large files based on MapReduce / Hadoop. I will describe what I did step-by-step, so you can replicate my results if you feel like it. The basic steps to get started are well-documented on the Apache website.
1. First I started one of Eric Hammond’s fantastic Ubuntu EC2 AMIs: ami-0757b26e
2. I logged in via ssh and configured the work environment:
-
apt-get install sun-java6-jdk ant ant-optional subversion
-
mv trunk pig
-
export JAVA_HOME=/usr/lib/jvm/java-6-sun
-
export PIGDIR=~/pig
-
cd pig
-
ant
-
cd tutorial
-
ant
-
tar -xzf pigtutorial.tar.gz
Now the build should have been successful and we can try out some basic Pig commands. The PigTutorial is a good starting point.
First: Tutorial Pig Script in Local Mode
The Query Phrase Popularity script [...] processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.Move to $PIGDIR/pig/tutorial/pigtmp and execute
-
java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local script1-local.pig
After running the script, the file script1-local-results.txt was successfully created and shows some results. Let’s move on and do something on our own.
Second: Analyze Wikipedia Logs (again local mode)
I created the directory $PIGDIR/wikianalysis and copied some user log data from Wikipedia into the file ’stats’. Make sure that the ’stats’ file does not contain any empty lines, as this will result in empty tuples when loading the data into an alias and you will end up with error messages like... -
Courtesy:- http://markusklems.wordpress.com/
- guru's blog
- Login to post comments
