1: Use a regular expression to load the name:
Load '/user/wizad/data/wizad/raw/2014-0 {6, 7---7--} */3_1/adwords *'
2: simple usage of filter:
Filter by value
Filter clickdate_all by log_type = '2 ';
Filter mapping_table by mapping_ad_network_id = '3' and mapping_type = '5 ';
Test = filter allrow by (ad_id = '20160901' or ad_id = '20160901' or ad_id = '20160901') and log_type = 2;
Test = filter allrow by (indexof (ad_id, '20140901') = 0 or indexof (ad_id, '20140901') = 0 or indexof (ad_id, '20140901 ') = 0) and log_type = 2;
Combined with the size function
Filter count_imei by (SIZE (cimei)> 14 and size (cimei) <17 );
Regular Expression
Filter cimei2 by not cimei matches '^ [0-9] * $ ';
Filter cmac2 by CMAC matches '/[A-F \ D] {2}: [A-F \ D] {2}: [A-F \ D] {2 }: [A-F \ D] {2}: [A-F \ D] {2}: [A-F \ D] {2 }/';
3: Sorting
Order province_count by $2 DESC;
4: Use of the Concat function. It can be used to generate an independent column. For example, if a number is counted, add a column name.
Foreach origin_cleaned_data generate Concat ('<-_', '->') as cou, guid, log_type;
Read_social_14 = foreach metadata_social_14 generate Concat ('14', '='), guid_social;
All_id = foreach allrow generate ID, Concat ('_', '-') as CC;
5: replace values: filter NULL values and change them to unknown values.
Origin_historical = foreach origin_cleaned_data generate wizad_ad_id, guid, log_type,
(Province_region_id = '')? 'Unknown ': province_region_id)
6: split into different subsets by value:
Split geelytuiguang into Android if OS _id = 1, IOS if OS _id = 2;
Split IOS into ios6 if (indexof (OS _version, '7 ')! = 0), ios7 if indexof (OS _version, '7') = 0;
7: replace function replacement value
Foreach ios6 generate IMEI, mac_address as CMAC, replace (idfa, 'null ','');
8:
En_guid = stream duimei through 'awk-F "," '{if ($3 = "null") Print $1 "," $2 ","; else Print $0 }'';
Some instance syntax used in Pig