The group statement can put data with the same key value
AggregationTogether, there is an essential difference from group operations in SQL, where groups created by a group by sentence must be injected directly into one or more aggregate functions. There is no direct relationship between group and aggregate functions in pig Latin. The group keyword is as it is literally expressed:
encapsulates all records containing the values corresponding to a particular key into a bag, after which the user can pass the result to an aggregate function or use it for some other processing. trigger reduce phase The contents of the data file are as follows:
[email protected] ~]$ cat Orders.data 1 apple x2 Apple x3 Banana Y4 pear y5 Banana 10
Load Data and Group
data = Load '/orders.data ' as (Orderid:int, Fruit:chararray, amount:int); GRPD = Group data by fruit;
After viewing the grouped data pattern, only two fields are grouped: group (Group field), data (column name is a grouped dataset alias, and data is a bag of all data.
Describe GRPD;GRPD: {group:chararray,data: {(orderid:int,fruit:chararray,amount:int)}}
View grouped data
Dump GRPD; (pear,{(4,pear,20)}) (Apple,{(2,apple,50), (1,apple,30)}) (Banana,{(5,banana,10), (3,banana,30)})
Use an aggregate function to process a grouped result set:
Dump GRPD; (pear,{(4,pear,20)}) (Apple,{(2,apple,50), (1,apple,30)}) (Banana,{(5,banana,10), (3,banana,30)})
Group data by $0+$1;
to group multiple keysThe data after grouping has two fields, one is the tuple of group, and the other is Baggroup data by (filed1, Field2)
Orders = Load '/orders.data ' as (Orderid:int, Fruit:chararray, Amount:int, type:chararray); GRPD = Group orders by (fruit, Type);d Escribe GRPD;GRPD: {group: (Fruit:chararray,type:chararray), Orders: {(orderid:int,fruit:chararray,amount: Int,type:chararray)}}dump GRPD ((pear,y), {(4,pear,20,y)}) ((Apple,x), {(2,apple,50,x), (1,apple,30,x)}) ((Banana,y) , {(5,banana,10,y), (3,banana,30,y)})
sums = foreach GRPD generate group, SUM (Orders.amount);d UMP sums; ((pear,y), +) ((apple,x), (banana,y), 40)
SUMS2 = foreach GRPD generate group.$0, group.$1, SUM (orders.amount);d UMP Sums2; (pear,y,20) (apple,x,80) (banana,y,40
Group Allput all the data in a dataset into a group
GRPD = Group Orders All;describe GRPD;GRPD: {group:chararray,orders: {(orderid:int,fruit:chararray,amount:int,type:ch Ararray)}}dump GRPD; (all,{(5,banana,10,y), (4,pear,20,y), (3,banana,30,y), (2,apple,50,x), (1,apple,30,x)})
co-group multiple datasets Group
A = LOAD ' data1 ' as (owner:chararray,pet:chararray);D UMP A; (alice,turtle) (alice,goldfish) (Alice,cat) (Bob,dog) (Bob, Cat) b = LOAD ' data2 ' as (friend1:chararray,friend2:chararray);D UMP B; (cindy,alice) (Mark,alice) (Paul,bob) (Paul,jane) X = Cogroup A by owner, B by Friend2;describe X; X: {group:chararray,a: {owner:chararray,pet:chararray},b: {friend1:chararray,friend2:chararray}}dump X; (Alice,{( Alice,turtle), (Alice,goldfish), (Alice,cat)},{(Cindy,alice), (Mark,alice)}) (Bob,{(Bob,dog), (Bob,cat)},{(Paul,Bob )}) (Jane,{},{(Paul,jane)})
partition by Parallel n
A = LOAD ' Input_data '; B = GROUP A by $ PARTITION by Org.apache.pig.test.utils.SimpleCustomPartitioner PARALLEL 2;
Simplecustompartitioner:
public class Simplecustompartitioner extends Partitioner <pignullablewritable, writable> { //@Override public int getpartition (pignullablewritable key, writable value, int numpartitions) { if (Key.getvalueaspigtype () instanceof Integer) { int ret = (((Integer) Key.getvalueaspigtype ()). Intvalue ()% numpartitions); return ret; } else { return (Key.hashcode ())% Numpartitions;}}}
NULL value handlingNull is a special grouping key, and all the tuples with the null key are aggregated into a group.
Pig Group Usage Example