Detailed explanation of Pig cogroup

Last Update:2014-11-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Starting from an instance

% Default file test.txt

A = load '$ file' as (date, web, name, food );

B = load '$ file' as (date, web, name, food );

C = cogroup A by $0, B by $1;

Describe C;

Using strate C;

Dump C;

$0 and $1 in the cogroup command. If the content of the two columns is different, two batches of groups are generated, grouped by A value first, and grouped by B value. When grouping by A value, B Corresponds to null, then the group has an empty group {}; but if the content is the same, such as C = cogroup A by $1, B by $1; is to generate A batch of group, which contains all the tuples that are equal to this value in both tables A and B.

The difference between COGROUP and join is that you are too lazy to write data. This is taken from the network.

The Join Operation results are flat (a group of tuples), while the COGROUP result has a nested structure.
Run the following command:
R1 = cogroup r_student by classNo, r_teacher by classNo;
Dump r1;
The result is as follows:
(C01, {(C01, N0103, 65), (C01, N0102, 59), (C01, N0101, 82) },{ (C01, Zhang )})
(C02, {(C02, N0203, 79), (C02, N0202, 82), (C02, N0201, 81) },{ (C02, Sun )})
(C03, {(C03, N0306, 72), (C03, N0302, 92), (C03, N0301, 56) },{ (C03, Wang )})
(C04, {}, {(C04, Dong )})
The results show that:
1) cogroup and join operations are similar.
2) The generated link has three fields. The first field is the connection field, the second field is a package, the value is all the tuples in Relation 1 that meet the matching relationship, and the third field is also a package, the value is all the tuples in relation 2 that meet the matching relationship.
3) External Join similar to Join. For example, for the Fourth Record in the result, the value of the second field is empty because no matching record exists in link 1. In fact, the first statement is equivalent to the following statement:
R1 = cogroup r_student by classNo outer, r_teacher by classNo outer;
If you want no matching record in link 1 or 2 to appear in the result, you can use the inner keyword in the link to exclude it.
Run the following statement:
R1 = cogroup r_student by classNo inner, r_teacher byclassNo outer;
Dump r1;
Result:
(C01, {(C01, N0103, 65), (C01, N0102, 59), (C01, N0101, 82) },{ (C01, Zhang )})
(C02, {(C02, N0203, 79), (C02, N0202, 82), (C02, N0201, 81) },{ (C02, Sun )})
(C03, {(C03, N0306, 72), (C03, N0302, 92), (C03, N0301, 56) },{ (C03, Wang )})

Flatten:
R2 = foreach r1 generate flatten ($1), flatten ($2 );
Dump r2;
The result is as follows:
(C01, N0103, 65, C01, Zhang)
(C01, N0102, 59, C01, Zhang)
(C01, N0101, 82, C01, Zhang)
(C02, N0203, 79, C02, Sun)
(C02, N0202, 82, C02, Sun)
(C02, N0201, 81, C02, Sun)
(C03, N0306, 72, C03, Wang)
(C03, N0302, 92, C03, Wang)
(C03, N0301, 56, C03, Wang)

As you can see, two flatten columns are automatically mapped to generate multiple columns.

For cogroup, I tested the core Code as follows:

Industry_existed_Data = LOAD '$ industryPath' USING PigStorage (',') AS (industryId: chararray, guid: chararray, sex: chararray, log_type: chararray );

Sample_data = limit industry_existed_Data 20;
-- STORE sample_data INTO '/user/wizad/tmp/industry_existed_Data' USING PigStorage (',');

-- Merge with history data
CogroupIndustryExistCurrentByGuid = COGROUP industry_existed_Data by guid, industry_current_data by guid;
Mydata = sample cogroupIndustryExistCurrentByGuid 0.1;
Dump mydata;
Describe cogroupIndustryExistCurrentByGuid;
-- Dump cogroupIndustryExistCurrentByGuid;

-- STORE mycogroupdata INTO '/user/wizad/tmp/cogroupIndustryExistCurrentByGuid' USING PigStorage (',');

Look_for_cogroup = FOREACH cogroupIndustryExistCurrentByGuid GENERATE $0, $2;
Describe look_for_cogroup;

IndustryStorageDataTmp = FOREACH cogroupIndustryExistCurrentByGuid generate flatten ($2 );
IndustryStorageData = DISTINCT IndustryStorageDataTmp;
Describe IndustryStorageData;

Display result:
The structure of the three data items is as follows:
CogroupIndustryExistCurrentByGuid:
{
Group: chararray,
Industry_existed_Data: {industryId: chararray, guid: chararray, sex: chararray, log_type: chararray },
Industry_current_data: {rule: industryId: chararray, joined_Orgin_sex_data: Role: guid: chararray, role: social_sex: sex: chararray, role: Role: log_type: chararray}
}

Look_for_cogroup:
{
Group: chararray,
Industry_current_data: {rule: industryId: chararray, joined_Orgin_sex_data: Role: guid: chararray, role: social_sex: sex: chararray, role: Role: log_type: chararray}
}

IndustryStorageData:
{
Industry_current_data: joined_ad_campaign_data: industryId: chararray,
Industry_current_data: joined_Orgin_sex_data: distinct_origin_historical_sex: guid: chararray,
Industry_current_data: joined_Orgin_sex_data: social_sex: sex: chararray,
Industry_current_data: joined_Orgin_sex_data: distinct_origin_historical_sex: log_type: chararray
}

It can be seen that the structure of the three data is very complex, because the previous association contains the Object Name (or domain name), specifying the object to which it belongs. You can only view the name and format of the last column.
The third is the result of flatten ($2.

The cogroup has an empty set problem, that is, each value in the corresponding group (the value of the key used by the cogroup). After the two sets are respectively grouped by the key value, the set corresponding to some keys is empty.
The actual data of the above pig code is as follows. Using guid as the associated key, we can see that many empty sets {} appear after the values of some guids correspond to the set.
Therefore, when retrieving data, note that only one flatten column causes data loss in other columns because it corresponds to the empty set of the flatten column.

(-1,), {(74,905 1235c-a391-4dae-ab22-f93d24a12636,-1,-1,), (75,053 Tib,-1,-1,), (74, percentile,-1, -1,), (74, fec1932a-b0e4-4bf0-b504-8ed8f3c159e7,-1,-1,), (74, d74374ec-8cf4-4c4a-b598-9631f6972cbb,-1,-1,), (74,678 0962a-bf75-4c4c-a557-94a7de5a3e36,-1, -1,), (-ee3f-4d34-943f-d6f1813afdef,-1,-1,), (74, c5547aca-3b8b-4108-93ba-bf365c106cdd,-1,-1,), (74, e9a986c1-6868-4f7f-baf6-69d8c302583e,-1, -1,), (74, percentile,-1,-1,), (74, f16e6222-a84b-4758-ae71-0613c8f34b29,-1,-1,), (74, 47cc25ef-05bc-47f4-a32b-3cddaf0ac22b,-1, -1,), (74, d5c1b6b0-38c3-464b-8cb9-70ced875be5f,-1,-1,), (74, average,-1,-1,), (74, 23bb2f0c-d629-479d-800e-b86fc3d6e45c,-1, -1 ,)})
(A50a17bde79ac018,), {(74,863010025134441, a50a17bde79ac018, 863010025134441 ,)})
(A51779f736cd3f54,), {(74,862949029595753, a51779f736cd3f54, 862949029595753 ,)})
(C7ae5867-3b77-4987-b082-ed3867b5c384,), {(74,353627055387065, c7ae5867-3b77-4987-b082-ed3867b5c384, 353627055387065 ,)})

Installation and testing of Pig

Pig installation and configuration tutorial

Pig installation and deployment and testing in MapReduce Mode

Install Pig and test in local mode.

Installation configuration and basic use of Pig

Hadoop Pig advanced syntax

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detailed explanation of Pig cogroup

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Detailed explanation of Pig cogroup

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support