Pig language

Last Update:2015-03-17 Source: Internet

Author: User

Keywords We function 2007

Tags access advanced advanced query language apache array basic clone compiled

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pig is a Yahoo donated project to Apache and is currently in the Apache incubator, but the basic functionality is already available. Today I would like to introduce you to this useful pig.pig is Sql-like language, is built on the mapreduce of an advanced query language, Some operations are compiled into the MapReduce model's map and reduce, and users can define their own capabilities. Yahoo Grid Computing department developed another clone of Google's project: Sawzall.

Supported operations
Arithmetic symbol: +,-, *,/
Multiple data types: String,int,float,long
Comparison operations: = = = =,!=, >=, <=,eq, NEQ, GT, GTE, Lt,lte,matches
Complex data type: Bag,tuple,map
Related operations: Filter,group By,order,distinct,union,join,foreach ... GENERATE
Statistics: Count,sum,avg,min,max.

Raw type data supported by Pig: Int,long,float,double,char array, byte array

Pig Internal data type:
Bag: A collection of tuple, represented by: {<1,2>,<3,4>}
Tuple: An array of sorts, representing the way:<pig,3.14>
Map: A key,value map data representing the way: [' pig ':< ' load ', ' Store ', ' web ': ' hadoop.apache.org ']
Atom: A single raw type of data, stored as a string, can also be converted to a numeric type. Means: ' apache.org ' or ' 2.3 '.

Data representation:

T = < 1, {<2,3>,<4,6>,<5,7>}, [' Apache ': ' Hadoop ']>

In the example above, a tuple is referenced to T, so T has 3 domains F1,F2,F3, we can access to 1 through T.F1 or t.$0, access to {T.F2 through t.$1 or <2,3>,<4,6>,<5,7> Access to [' Apache ': ' Hadoop '] via t.f3 or t.$2.

Pig can be run either in local or cluster ways. Let's start with the Apache log file to explain the pig language with our pig script example. Our log (Access.log) contains many days of access logs, We need to know how many times each IP page is accessed per hour on January 30, 2007. Before running the program please make sure you run in Java version 1.5 and download the example file.

Local mode (only non-Windows systems supported):
Please delete the Hadoop-site.xml file. Run:

java-cp.:p ig.jar org.apache.pig.main-x Local Log.pig

Cluster (support for Windows systems):
Make sure that the cluster version of your Hadoop is 0.17.0, modifying the values in the Fs.default.name,mapred.job.tracker,mapred.system.dir in the Hadoop-site.xml, Make these values the same as cluster.

java-cp.:p Ig.jar org.apache.pig.Main Log.pig

View results:

Cat logs/20070130;

Script Explanation:

Copy Access.log to HDFs using Hadoop's copyfromlocal command

copyfromlocal Access.log Access.log;

Set MapReduce work name

set Job.name ' hadoop.org.cn Log Parser ';

Loading log files with user-defined features

in = LOAD ' Access.log ' USING cn.org.hadoop.pig.storage.LogStorage ();

Because the date format is "21/jan/2007:15:29:24 +0800" in the NSCA log format, it is converted to 20070121152924 format

Gen = FOREACH in GENERATE $0,cn.org.hadoop.pig.time.formattime ($), *;

Filter out abnormal rows.

result = "FILTER Gen by" (Not IsEmpty ($));

Store results to HDFs user's temp directory

STORE result into ' temp ';

Reset MapReduce Work Name

set Job.name ' hadoop.org.cn filter parser ';

To load files in the Temp directory with the default feature (Pigstorage)

A = LOAD ' temp ' as (ip,date,method,url,protocol,code,bytes);

Extract the result set dated 2007-01-30 days

B = FILTER A by (date matches ' 20070130.* ');

Because we only care about the results per hour, we call the user-defined feature Extracttime to extract the hours of the day

C = FOREACH B GENERATE ip,cn.org.hadoop.pig.time.extracttime (date, ' 8 ', ') as hour;

Using the Group feature

D = GROUP C by (Ip,hour);

Calculate how many times each IP page is accessed per hour

E = FOREACH D GENERATE Flatten ($), COUNT ($);

In descending order of hours

F = Order E by USING Cn.org.hadoop.pig.sort.Desc;

Store results to Directory

STORE F into ' logs/20070130 ';

Contains the source code in the Udfs.jar file of the compressed package, and the package contains the PIG.VIM codes of the pig language highlighted under the VM.
Pig.vim Installation Method:
1. Copy Pig.vim to ~/.vim/syntax/directory
2. Edit ~/.VIMRC Add the following line:
Augroup Filetypedetect
Au bufnewfile,bufread *.pig set Filetype=pig syntax=pig
Augroup End

Pig language For more information, please visit Pig Wiki

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More