Pig is a Yahoo donated project to Apache and is currently in the Apache incubator, but the basic functionality is already available. Today I would like to introduce you to this useful pig.pig is Sql-like language, is built on the mapreduce of an advanced query language, Some operations are compiled into the MapReduce model's map and reduce, and users can define their own capabilities. Yahoo Grid Computing department developed another clone of Google's project: Sawzall.
Supported operations
Arithmetic symbol: +,-, *,/
Multiple data types: String,int,float,long
Comparison operations: = = = =,!=, >=, <=,eq, NEQ, GT, GTE, Lt,lte,matches
Complex data type: Bag,tuple,map
Related operations: Filter,group By,order,distinct,union,join,foreach ... GENERATE
Statistics: Count,sum,avg,min,max.
Raw type data supported by Pig: Int,long,float,double,char array, byte array
Pig Internal data type:
Bag: A collection of tuple, represented by: {<1,2>,<3,4>}
Tuple: An array of sorts, representing the way:<pig,3.14>
Map: A key,value map data representing the way: [' pig ':< ' load ', ' Store ', ' web ': ' hadoop.apache.org ']
Atom: A single raw type of data, stored as a string, can also be converted to a numeric type. Means: ' apache.org ' or ' 2.3 '.
Data representation:
T = < 1, {<2,3>,<4,6>,<5,7>}, [' Apache ': ' Hadoop ']>
In the example above, a tuple is referenced to T, so T has 3 domains F1,F2,F3, we can access to 1 through T.F1 or t.$0, access to {T.F2 through t.$1 or <2,3>,<4,6>,<5,7> Access to [' Apache ': ' Hadoop '] via t.f3 or t.$2.
Pig can be run either in local or cluster ways. Let's start with the Apache log file to explain the pig language with our pig script example. Our log (Access.log) contains many days of access logs, We need to know how many times each IP page is accessed per hour on January 30, 2007. Before running the program please make sure you run in Java version 1.5 and download the example file.
Local mode (only non-Windows systems supported):
Please delete the Hadoop-site.xml file. Run:
java-cp.:p ig.jar org.apache.pig.main-x Local Log.pig
Cluster (support for Windows systems):
Make sure that the cluster version of your Hadoop is 0.17.0, modifying the values in the Fs.default.name,mapred.job.tracker,mapred.system.dir in the Hadoop-site.xml, Make these values the same as cluster.
java-cp.:p Ig.jar org.apache.pig.Main Log.pig
View results:
Cat logs/20070130;
Script Explanation:
Copy Access.log to HDFs using Hadoop's copyfromlocal command
copyfromlocal Access.log Access.log;
Register a jar file that contains user-defined functionality (UDFS)
REGISTER Udfs.jar;
Set MapReduce work name
set Job.name ' hadoop.org.cn Log Parser ';
Loading log files with user-defined features
in = LOAD ' Access.log ' USING cn.org.hadoop.pig.storage.LogStorage ();
Because the date format is "21/jan/2007:15:29:24 +0800" in the NSCA log format, it is converted to 20070121152924 format
Gen = FOREACH in GENERATE $0,cn.org.hadoop.pig.time.formattime ($), *;
Filter out abnormal rows.
result = "FILTER Gen by" (Not IsEmpty ($));
Store results to HDFs user's temp directory
STORE result into ' temp ';
Reset MapReduce Work Name
set Job.name ' hadoop.org.cn filter parser ';
To load files in the Temp directory with the default feature (Pigstorage)
A = LOAD ' temp ' as (ip,date,method,url,protocol,code,bytes);
Extract the result set dated 2007-01-30 days
B = FILTER A by (date matches ' 20070130.* ');
Because we only care about the results per hour, we call the user-defined feature Extracttime to extract the hours of the day
C = FOREACH B GENERATE ip,cn.org.hadoop.pig.time.extracttime (date, ' 8 ', ') as hour;
Using the Group feature
D = GROUP C by (Ip,hour);
Calculate how many times each IP page is accessed per hour
E = FOREACH D GENERATE Flatten ($), COUNT ($);
In descending order of hours
F = Order E by USING Cn.org.hadoop.pig.sort.Desc;
Store results to Directory
STORE F into ' logs/20070130 ';
Contains the source code in the Udfs.jar file of the compressed package, and the package contains the PIG.VIM codes of the pig language highlighted under the VM.
Pig.vim Installation Method:
1. Copy Pig.vim to ~/.vim/syntax/directory
2. Edit ~/.VIMRC Add the following line:
Augroup Filetypedetect
Au bufnewfile,bufread *.pig set Filetype=pig syntax=pig
Augroup End
Pig language For more information, please visit Pig Wiki