Pig simple code example: report statistics on clicks and exposures in the industry
Note: run or exec is used in pig to run the script. Other commands except cd and ls are not required. In this Code, the rm and mv commands are used as examples, which are prone to errors.
In addition, pig only loads data in store or dump mode. Otherwise, it only loads code and does not perform specific operations on data. Therefore, you must check whether the file has been generated during the rm operation. If the rm file is generated, you can change the name of the mv in the third file.
SET job. name 'test _ age_reporth_istorical '; -- defines the name of the task, which is viewed in the http://172.XX.XX.XX: 50030/jobtracker. jsp. The task succeeded.
SET job. priority HIGH; -- priority
-- Register the jar package to read the sequence file and output the analysis result file.
REGISTER piggybank. jar;
DEFINE SequenceFileLoader org. apache. pig. piggybank. storage. SequenceFileLoader (); -- Read Binary file, function name Definition
% Default Cleaned_Log/user/C/data/XXX/cleaned/$ date/*/part * -- $ date is an external input parameter
% Default AD_Data/user/XXX/data/xxx/metadata/ad/part *
% Default Campaign_Data/user/xxx/data/xxx/metadata/campaign/part *
% Default Social_Data/user/xxx/data/report/socialdata/part *
-- All output file paths:
% Default Industry_Path $ file_path/report/historical/age/$ year/industry
% Default Industry_SUM $ file_path/report/historical/age/$ year/industry_sum
% Default Industry_TMP $ file_path/report/historical/age/$ year/industry_tmp
% Default Industry_Brand_Path $ file_path/report/historical/age/$ year/industry_brand
% Default Industry_Brand_SUM $ file_path/report/historical/age/$ year/industry_brand_sum
% Default Industry_Brand_TMP $ file_path/report/historical/age/$ year/industry_brand_tmp
% Default ALL_Path $ file_path/report/historical/age/$ year/all
% Default ALL_SUM $ file_path/report/historical/age/$ year/all_sum
% Default ALL_TMP $ file_path/report/historical/age/$ year/all_tmp
% Default output_path/user/xxx/tmp/result
Origin_cleaned_data = LOAD '$ Cleaned_Log' USING PigStorage (',') -- read log files
AS (ad_network_id: chararray,
Xxx_ad_id: chararray,
Guid: chararray,
Id: chararray,
Create_time: chararray,
Action_time: chararray,
Log_type: chararray,
Ad_id: chararray,
Positioning_method: chararray,
Location_accuracy: chararray,
Lat: chararray,
Lon: chararray,
Cell_id: chararray,
Lac: chararray,
Mcc: chararray,
Mnc: chararray,
Ip: chararray,
Connection_type: chararray,
Android_id: chararray,
Android_advertising_id: chararray,
Openudid: chararray,
Mac_address: chararray,
Uid: chararray,
Density: chararray,
Screen_height: chararray,
Screen_width: chararray,
User_agent: chararray,
App_id: chararray,
App_category_id: chararray,
Device_model_id: chararray,
Carrier_id: chararray,
OS _id: chararray,
Device_type: chararray,
OS _version: chararray,
Country_region_id: chararray,
Province_region_id: chararray,
City_region_id: chararray,
Ip_lat: chararray,
Ip_lon: chararray,
Quadkey: chararray );
-- Loading metadata/ad (adId, campaignId)
Metadata_ad = LOAD '$ AD_Data' USING PigStorage (',') AS (adId: chararray, campaignId: chararray );
-- Loading metadata/campaign metadata • ° metadata®(CampaignId, industryId, brandId)
Metadata_campaign = LOAD '$ Campaign_Data' USING PigStorage (',') AS (campaignId: chararray, industryId: chararray, brandId: chararray );
-- Ad and campaign for inner join
JoinAdCampaignByCampaignId = JOIN metadata_ad BY campaignId, metadata_campaign BY campaignId; -- (adId, campaignId, campaignId, industryId, brandId)
-- Filtering out redundant column of joinAdCampaignByCampaignId
Joined_ad_campaign_data = FOREACH joinAdCampaignByCampaignId GENERATE $0 AS adId, $3 AS industryId, $4 AS brandId; -- (adId, industryId, brandId)
-- Extract column for analyzing
Origin_historical_age = FOREACH origin_cleaned_data GENERATE xxx_ad_id, guid, log_type; -- (xxx_ad_id, guid, log_type)
-- Distinct
Distinct_origin_historical_age = DISTINCT origin_historical_age; -- (xxx_ad_id, guid, log_type)
-- Loading metadata_region (guid_social, sex, age, income, edu, holobby)
Metadata_social = LOAD '$ Social_Data' USING PigStorage (',') AS (guid_social: chararray, sex: chararray, age: chararray, income: chararray, edu: chararray, Hober: chararray );
-- Extract needed column in metadata_social
Social_age = FOREACH metadata_social GENERATE guid_social, age;
-- Join socialData (metadata_social) and logData (distinct_origin_historical_age ):
JoinedByGUID = JOIN social_age BY guid_social, distinct_origin_historical_age BY guid;
-- (Guid_social, age; xxx_ad_id, guid, log_type)
-- Generating analyzing age data
Joined_orgin_age_data = FOREACH joinedByGUID GENERATE xxx_ad_id, guid, log_type, age;
JoinedByAdId = JOIN joined_ad_campaign_data BY adId, joined_orgin_age_data BY xxx_ad_id; -- (adId, industryId, brandId, xxx_ad_id, guid, log_type, age)
-- Filtering
All_current_data = FOREACH joinedByAdId GENERATE guid, log_type, industryId, brandId, age; -- (guid, log_type, industryId, brandId, age)
-- For industry analyzing
Industry_current_data = FOREACH all_current_data GENERATE industryId, guid, age, log_type; -- (industryId, guid, age, log_type)
-- Load all in the path "industry"
Industry_existed_Data = LOAD '$ Industry_Path' USING PigStorage (',') AS (industryId: chararray, guid: chararray, age: chararray, log_type: chararray );
-- Merge with history data
Union_Industry = UNION industry_existed_Data, industry_current_data;
Distict_union_industry = DISTINCT union_Industry;
Group_industry = GROUP distict_union_industry BY ($2, $0, $3 );
Count_guid_for_industry = FOREACH group_industry generate flatten (group), COUNT ($1. $1 );
Rm $ Industry_SUM;
STORE count_guid_for_industry INTO '$ Industry_SUM 'using PigStorage (',');
-- Storing union industry data (current and history)
STORE distict_union_industry INTO '$ Industry_TMP' USING PigStorage (',');
Rm $ Industry_Path
Mv $ Industry_TMP $ Industry_Path
-- Counting guid for industry and brand
Industry_brand_current = FOREACH all_current_data GENERATE age, industryId, brandId, log_type, guid;
-- (Age, industryId, brandId, log_type, guid)
-- Load history data of industry_brand
Industry_brand_history = LOAD '$ Industry_Brand_Path' USING PigStorage (',') AS (age: chararray, industryId: chararray, brandId: chararray, log_type: chararray, guid: chararray );
-- Union all data of industry_brand
Union_industry_brand = UNION industry_brand_current, industry_brand_history;
Unique_industry_brand = DISTINCT union_industry_brand;
-- (Age, industryId, brandId, log_type, guid)
-- Counting users 'Number for industry and brand
Group_industry_brand = GROUP unique_industry_brand BY ($0, $1, $2, $3 );
Count_guid_for_industry_brand = FOREACH group_industry_brand generate flatten (group), COUNT ($1. $4 );
Rm $ Industry_Brand_SUM;
STORE count_guid_for_industry_brand INTO '$ Industry_Brand_SUM 'using PigStorage (',');
STORE unique_industry_brand INTO '$ Industry_Brand_TMP' USING PigStorage (',');
Rm $ Industry_Brand_Path;
Mv $ Industry_Brand_TMP $ Industry_Brand_Path
-- Counting user number for age and logtype
Current_data = FOREACH all_current_data GENERATE age, log_type, guid; -- (age, log_type, guid)
-- Load history data of age and logtype
History_data = LOAD '$ ALL_Path' USING PigStorage (',') AS (age: chararray, log_type: chararray, guid: chararray );
-- Union current and history data
Union_all_data = UNION history_data, current_data;
Unique_all_data = DISTINCT union_all_data;
-- Count users 'Number
Group_all_data = GROUP unique_all_data BY ($0, $1 );
Count_guid_for_age_logtype = FOREACH group_all_data generate flatten (group), COUNT ($1. $2 );
Rm $ ALL_SUM;
STORE count_guid_for_age_logtype INTO '$ ALL_SUM 'using PigStorage (',');
STORE unique_all_data INTO '$ ALL_TMP' USING PigStorage (',');
Rm $ ALL_Path
Mv $ ALL_TMP $ ALL_Path
Installation and testing of Pig
Pig installation and configuration tutorial
Pig installation and deployment and testing in MapReduce Mode
Install Pig and test in local mode.
Installation configuration and basic use of Pig
Hadoop Pig advanced syntax
This article permanently updates the link address: