Pig Group Usage Example

Source: Internet
Author: User
Tags pear

The group statement can put data with the same key value AggregationTogether, there is an essential difference from group operations in SQL, where groups created by a group by sentence must be injected directly into one or more aggregate functions.    There is no direct relationship between group and aggregate functions in pig Latin. The group keyword is as it is literally expressed: encapsulates all records containing the values corresponding to a particular key into a bag, after which the user can pass the result to an aggregate function or use it for some other processing. trigger reduce phase The contents of the data file are as follows:
[email protected] ~]$ cat Orders.data 1       apple    x2       Apple    x3       Banana    Y4       pear    y5       Banana  10    

  

Load Data and Group
data = Load '/orders.data ' as (Orderid:int, Fruit:chararray, amount:int); GRPD = Group data by fruit;

  

After viewing the grouped data pattern, only two fields are grouped: group (Group field), data (column name is a grouped dataset alias, and data is a bag of all data.
Describe GRPD;GRPD: {group:chararray,data: {(orderid:int,fruit:chararray,amount:int)}}

  

View grouped data
Dump GRPD; (pear,{(4,pear,20)}) (Apple,{(2,apple,50), (1,apple,30)}) (Banana,{(5,banana,10), (3,banana,30)})

  

Use an aggregate function to process a grouped result set:
Dump GRPD; (pear,{(4,pear,20)}) (Apple,{(2,apple,50), (1,apple,30)}) (Banana,{(5,banana,10), (3,banana,30)})

Group data by $0+$1;

  

to group multiple keysThe data after grouping has two fields, one is the tuple of group, and the other is Baggroup data by (filed1, Field2)
Orders = Load '/orders.data ' as (Orderid:int, Fruit:chararray, Amount:int, type:chararray); GRPD = Group orders by (fruit, Type);d Escribe GRPD;GRPD: {group: (Fruit:chararray,type:chararray), Orders: {(orderid:int,fruit:chararray,amount: Int,type:chararray)}}dump GRPD ((pear,y), {(4,pear,20,y)}) ((Apple,x), {(2,apple,50,x), (1,apple,30,x)}) ((Banana,y) , {(5,banana,10,y), (3,banana,30,y)})

  

sums = foreach GRPD generate group, SUM (Orders.amount);d UMP sums; ((pear,y), +) ((apple,x), (banana,y), 40)

  

SUMS2 = foreach GRPD generate group.$0, group.$1, SUM (orders.amount);d UMP Sums2; (pear,y,20) (apple,x,80) (banana,y,40

  

Group Allput all the data in a dataset into a group
GRPD = Group Orders All;describe GRPD;GRPD: {group:chararray,orders: {(orderid:int,fruit:chararray,amount:int,type:ch Ararray)}}dump GRPD; (all,{(5,banana,10,y), (4,pear,20,y), (3,banana,30,y), (2,apple,50,x), (1,apple,30,x)})

  

co-group multiple datasets Group
A = LOAD ' data1 ' as (owner:chararray,pet:chararray);D UMP A; (alice,turtle) (alice,goldfish) (Alice,cat) (Bob,dog) (Bob, Cat) b = LOAD ' data2 ' as (friend1:chararray,friend2:chararray);D UMP B; (cindy,alice) (Mark,alice) (Paul,bob) (Paul,jane) X = Cogroup A by owner, B by Friend2;describe X; X: {group:chararray,a: {owner:chararray,pet:chararray},b: {friend1:chararray,friend2:chararray}}dump X; (Alice,{( Alice,turtle), (Alice,goldfish), (Alice,cat)},{(Cindy,alice), (Mark,alice)}) (Bob,{(Bob,dog), (Bob,cat)},{(Paul,Bob )}) (Jane,{},{(Paul,jane)})

  

partition by Parallel n
A = LOAD ' Input_data '; B = GROUP A by $ PARTITION by Org.apache.pig.test.utils.SimpleCustomPartitioner PARALLEL 2;

  

Simplecustompartitioner:
public class Simplecustompartitioner extends Partitioner <pignullablewritable, writable> {      //@Override     public int getpartition (pignullablewritable key, writable value, int numpartitions) {         if (Key.getvalueaspigtype () instanceof Integer) {             int ret = (((Integer) Key.getvalueaspigtype ()). Intvalue ()% numpartitions);             return ret;        }        else {             return (Key.hashcode ())% Numpartitions;}}}     

  

NULL value handlingNull is a special grouping key, and all the tuples with the null key are aggregated into a group.

Pig Group Usage Example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.