Detailed explanation of Pig cogroup
Starting from an instance
% Default file test.txt
A = load '$ file' as (date, web, name, food );
B = load '$ file' as (date, web, name, food );
C = cogroup A by $0, B by $1;
Describe C;
Using strate C;
Dump C;
$0 and $1 in the cogroup command. If the content of the two columns is different, two batches of groups are generated, grouped by A value first, and grouped by B value. When grouping by A value, B Corresponds to null, then the group has an empty group {}; but if the content is the same, such as C = cogroup A by $1, B by $1; is to generate A batch of group, which contains all the tuples that are equal to this value in both tables A and B.
The difference between COGROUP and join is that you are too lazy to write data. This is taken from the network.
The Join Operation results are flat (a group of tuples), while the COGROUP result has a nested structure.
Run the following command:
R1 = cogroup r_student by classNo, r_teacher by classNo;
Dump r1;
The result is as follows:
(C01, {(C01, N0103, 65), (C01, N0102, 59), (C01, N0101, 82) },{ (C01, Zhang )})
(C02, {(C02, N0203, 79), (C02, N0202, 82), (C02, N0201, 81) },{ (C02, Sun )})
(C03, {(C03, N0306, 72), (C03, N0302, 92), (C03, N0301, 56) },{ (C03, Wang )})
(C04, {}, {(C04, Dong )})
The results show that:
1) cogroup and join operations are similar.
2) The generated link has three fields. The first field is the connection field, the second field is a package, the value is all the tuples in Relation 1 that meet the matching relationship, and the third field is also a package, the value is all the tuples in relation 2 that meet the matching relationship.
3) External Join similar to Join. For example, for the Fourth Record in the result, the value of the second field is empty because no matching record exists in link 1. In fact, the first statement is equivalent to the following statement:
R1 = cogroup r_student by classNo outer, r_teacher by classNo outer;
If you want no matching record in link 1 or 2 to appear in the result, you can use the inner keyword in the link to exclude it.
Run the following statement:
R1 = cogroup r_student by classNo inner, r_teacher byclassNo outer;
Dump r1;
Result:
(C01, {(C01, N0103, 65), (C01, N0102, 59), (C01, N0101, 82) },{ (C01, Zhang )})
(C02, {(C02, N0203, 79), (C02, N0202, 82), (C02, N0201, 81) },{ (C02, Sun )})
(C03, {(C03, N0306, 72), (C03, N0302, 92), (C03, N0301, 56) },{ (C03, Wang )})
Flatten:
R2 = foreach r1 generate flatten ($1), flatten ($2 );
Dump r2;
The result is as follows:
(C01, N0103, 65, C01, Zhang)
(C01, N0102, 59, C01, Zhang)
(C01, N0101, 82, C01, Zhang)
(C02, N0203, 79, C02, Sun)
(C02, N0202, 82, C02, Sun)
(C02, N0201, 81, C02, Sun)
(C03, N0306, 72, C03, Wang)
(C03, N0302, 92, C03, Wang)
(C03, N0301, 56, C03, Wang)
As you can see, two flatten columns are automatically mapped to generate multiple columns.
For cogroup, I tested the core Code as follows:
Industry_existed_Data = LOAD '$ industryPath' USING PigStorage (',') AS (industryId: chararray, guid: chararray, sex: chararray, log_type: chararray );
Sample_data = limit industry_existed_Data 20;
-- STORE sample_data INTO '/user/wizad/tmp/industry_existed_Data' USING PigStorage (',');
-- Merge with history data
CogroupIndustryExistCurrentByGuid = COGROUP industry_existed_Data by guid, industry_current_data by guid;
Mydata = sample cogroupIndustryExistCurrentByGuid 0.1;
Dump mydata;
Describe cogroupIndustryExistCurrentByGuid;
-- Dump cogroupIndustryExistCurrentByGuid;
-- STORE mycogroupdata INTO '/user/wizad/tmp/cogroupIndustryExistCurrentByGuid' USING PigStorage (',');
Look_for_cogroup = FOREACH cogroupIndustryExistCurrentByGuid GENERATE $0, $2;
Describe look_for_cogroup;
IndustryStorageDataTmp = FOREACH cogroupIndustryExistCurrentByGuid generate flatten ($2 );
IndustryStorageData = DISTINCT IndustryStorageDataTmp;
Describe IndustryStorageData;
Display result:
The structure of the three data items is as follows:
CogroupIndustryExistCurrentByGuid:
{
Group: chararray,
Industry_existed_Data: {industryId: chararray, guid: chararray, sex: chararray, log_type: chararray },
Industry_current_data: {rule: industryId: chararray, joined_Orgin_sex_data: Role: guid: chararray, role: social_sex: sex: chararray, role: Role: log_type: chararray}
}
Look_for_cogroup:
{
Group: chararray,
Industry_current_data: {rule: industryId: chararray, joined_Orgin_sex_data: Role: guid: chararray, role: social_sex: sex: chararray, role: Role: log_type: chararray}
}
IndustryStorageData:
{
Industry_current_data: joined_ad_campaign_data: industryId: chararray,
Industry_current_data: joined_Orgin_sex_data: distinct_origin_historical_sex: guid: chararray,
Industry_current_data: joined_Orgin_sex_data: social_sex: sex: chararray,
Industry_current_data: joined_Orgin_sex_data: distinct_origin_historical_sex: log_type: chararray
}
It can be seen that the structure of the three data is very complex, because the previous association contains the Object Name (or domain name), specifying the object to which it belongs. You can only view the name and format of the last column.
The third is the result of flatten ($2.
The cogroup has an empty set problem, that is, each value in the corresponding group (the value of the key used by the cogroup). After the two sets are respectively grouped by the key value, the set corresponding to some keys is empty.
The actual data of the above pig code is as follows. Using guid as the associated key, we can see that many empty sets {} appear after the values of some guids correspond to the set.
Therefore, when retrieving data, note that only one flatten column causes data loss in other columns because it corresponds to the empty set of the flatten column.
(-1,), {(74,905 1235c-a391-4dae-ab22-f93d24a12636,-1,-1,), (75,053 Tib,-1,-1,), (74, percentile,-1, -1,), (74, fec1932a-b0e4-4bf0-b504-8ed8f3c159e7,-1,-1,), (74, d74374ec-8cf4-4c4a-b598-9631f6972cbb,-1,-1,), (74,678 0962a-bf75-4c4c-a557-94a7de5a3e36,-1, -1,), (-ee3f-4d34-943f-d6f1813afdef,-1,-1,), (74, c5547aca-3b8b-4108-93ba-bf365c106cdd,-1,-1,), (74, e9a986c1-6868-4f7f-baf6-69d8c302583e,-1, -1,), (74, percentile,-1,-1,), (74, f16e6222-a84b-4758-ae71-0613c8f34b29,-1,-1,), (74, 47cc25ef-05bc-47f4-a32b-3cddaf0ac22b,-1, -1,), (74, d5c1b6b0-38c3-464b-8cb9-70ced875be5f,-1,-1,), (74, average,-1,-1,), (74, 23bb2f0c-d629-479d-800e-b86fc3d6e45c,-1, -1 ,)})
(A50a17bde79ac018,), {(74,863010025134441, a50a17bde79ac018, 863010025134441 ,)})
(A51779f736cd3f54,), {(74,862949029595753, a51779f736cd3f54, 862949029595753 ,)})
(C7ae5867-3b77-4987-b082-ed3867b5c384,), {(74,353627055387065, c7ae5867-3b77-4987-b082-ed3867b5c384, 353627055387065 ,)})
Installation and testing of Pig
Pig installation and configuration tutorial
Pig installation and deployment and testing in MapReduce Mode
Install Pig and test in local mode.
Installation configuration and basic use of Pig
Hadoop Pig advanced syntax
This article permanently updates the link address: