Pig language

Source: Internet
Author: User
Keywords We function 2007
Tags access advanced advanced query language apache array basic clone compiled

Pig is a Yahoo donated project to Apache and is currently in the Apache incubator, but the basic functionality is already available. Today I would like to introduce you to this useful pig.pig is Sql-like language, is built on the mapreduce of an advanced query language, Some operations are compiled into the MapReduce model's map and reduce, and users can define their own capabilities. Yahoo Grid Computing department developed another clone of Google's project: Sawzall.

Supported operations
Arithmetic symbol: +,-, *,/
Multiple data types: String,int,float,long
Comparison operations: = = = =,!=, >=, <=,eq, NEQ, GT, GTE, Lt,lte,matches
Complex data type: Bag,tuple,map
Related operations: Filter,group By,order,distinct,union,join,foreach ... GENERATE
Statistics: Count,sum,avg,min,max.

Raw type data supported by Pig: Int,long,float,double,char array, byte array

Pig Internal data type:
Bag: A collection of tuple, represented by: {<1,2>,<3,4>}
Tuple: An array of sorts, representing the way:<pig,3.14>
Map: A key,value map data representing the way: [' pig ':< ' load ', ' Store ', ' web ': ' hadoop.apache.org ']
Atom: A single raw type of data, stored as a string, can also be converted to a numeric type. Means: ' apache.org ' or ' 2.3 '.

Data representation:

T = &lt; 1, {&lt;2,3&gt;,&lt;4,6&gt;,&lt;5,7&gt;}, [' Apache ': ' Hadoop ']&gt;

In the example above, a tuple is referenced to T, so T has 3 domains F1,F2,F3, we can access to 1 through T.F1 or t.$0, access to {T.F2 through t.$1 or <2,3>,<4,6>,<5,7> Access to [' Apache ': ' Hadoop '] via t.f3 or t.$2.

Pig can be run either in local or cluster ways. Let's start with the Apache log file to explain the pig language with our pig script example. Our log (Access.log) contains many days of access logs, We need to know how many times each IP page is accessed per hour on January 30, 2007. Before running the program please make sure you run in Java version 1.5 and download the example file.

Local mode (only non-Windows systems supported):
Please delete the Hadoop-site.xml file. Run:

java-cp.:p ig.jar org.apache.pig.main-x Local Log.pig

Cluster (support for Windows systems):
Make sure that the cluster version of your Hadoop is 0.17.0, modifying the values in the Fs.default.name,mapred.job.tracker,mapred.system.dir in the Hadoop-site.xml, Make these values the same as cluster.

java-cp.:p Ig.jar org.apache.pig.Main Log.pig

View results:

Cat logs/20070130;

Script Explanation:

Copy Access.log to HDFs using Hadoop's copyfromlocal command

copyfromlocal Access.log Access.log;

Register a jar file that contains user-defined functionality (UDFS)

REGISTER Udfs.jar;

Set MapReduce work name

set Job.name ' hadoop.org.cn Log Parser ';

Loading log files with user-defined features

in = LOAD ' Access.log ' USING cn.org.hadoop.pig.storage.LogStorage ();

Because the date format is "21/jan/2007:15:29:24 +0800" in the NSCA log format, it is converted to 20070121152924 format

Gen = FOREACH in GENERATE $0,cn.org.hadoop.pig.time.formattime ($), *;

Filter out abnormal rows.

result = "FILTER Gen by" (Not IsEmpty ($));

Store results to HDFs user's temp directory

STORE result into ' temp ';

Reset MapReduce Work Name

set Job.name ' hadoop.org.cn filter parser ';

To load files in the Temp directory with the default feature (Pigstorage)

A = LOAD ' temp ' as (ip,date,method,url,protocol,code,bytes);

Extract the result set dated 2007-01-30 days

B = FILTER A by (date matches ' 20070130.* ');

Because we only care about the results per hour, we call the user-defined feature Extracttime to extract the hours of the day

C = FOREACH B GENERATE ip,cn.org.hadoop.pig.time.extracttime (date, ' 8 ', ') as hour;

Using the Group feature

D = GROUP C by (Ip,hour);

Calculate how many times each IP page is accessed per hour

E = FOREACH D GENERATE Flatten ($), COUNT ($);

In descending order of hours

F = Order E by USING Cn.org.hadoop.pig.sort.Desc;

Store results to Directory

STORE F into ' logs/20070130 ';

Contains the source code in the Udfs.jar file of the compressed package, and the package contains the PIG.VIM codes of the pig language highlighted under the VM.
Pig.vim Installation Method:
1. Copy Pig.vim to ~/.vim/syntax/directory
2. Edit ~/.VIMRC Add the following line:
Augroup Filetypedetect
Au bufnewfile,bufread *.pig set Filetype=pig syntax=pig
Augroup End

Pig language For more information, please visit Pig Wiki

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.