Original is not easy, reproduced please be sure to indicate, original address, thank you for your cooperation!
http://qindongliang.iteye.com/
Pig series of learning documents, hope to be useful to everyone, thanks for the attention of the scattered fairy!
Apache Pig's past life
How does Apache pig customize UDF functions?
Apache Pig5 Line code How to implement Hadoop WordCount?
Apache Pig Getting Started learning document (i)
Apache Pig Study notes (ii)
Apache Pig Learning notes Built-in functions (iii)
Play the Big data series Apache Pig how to integrate with Apache Lucene (i)
Play the Big data series Apache Pig how to integrate with Apache SOLR (ii)
Play turn Big data series Apache Pig how to integrate with MySQL (iii)
How to play the Big data series How to give Apache Pig custom storage form (iv)
Play the Big data series Apache pig how to query a database with a custom UDF (v)
How to use pig Set component Word device to count news frequency?
650) this.width=650; "Src=" http://dl2.iteye.com/upload/attachment/0106/7743/ Bdb45a8c-0783-3da4-81ae-cf969e75f38b.png "alt=" Bdb45a8c-0783-3da4-81ae-cf969e75f38b.png "/>
In the Hadoop Ecosystem, if we were to analyze massive amounts of data off-line, most people would choose Apache hive or Apache Pig, and in general, Hive used a higher percentage of the population, while Pig used a relatively is much less, this is not because pig is immature and unstable, but because Hive provides query statements for class database SQL, which makes it very easy for most people to get started with hive, whereas pig provides scripting syntax for the class Linux shell, which makes most people dislike it.
If in the programming world, the statistics will be SQL and shell, that the proportion of large, scattered fairy think, no doubt certainly is the SQL statement. Because there are quite a few programmers who don't use Linux, Microsoft's set of dedicated servers from C # to Asp.net,sql server to Windows.
650) this.width=650; "Src=" http://dl2.iteye.com/upload/attachment/0105/3491/ 7c7b3bef-0dda-3ac6-8cdb-1ecc1dd9c194.jpg "alt=" 7c7b3bef-0dda-3ac6-8cdb-1ecc1dd9c194.jpg "/>
OK, pull away, hurry back, use the shell of the Siege division, I think will fall in love with it, because in the Linux system, there is no more concise and easy to use than the shell, if coupled with awk and sed more powerful.
We all know that the shell is a support function call, and this is very similar to JavaScript, by defining functions we can reuse a function, instead of a large number of coding, which, the change of things, separated into parameters, immutable things defined as statements, so since, Can reduce the redundancy and complexity of coding, imagine how incredible it would be if there were no methods in Java.
As the language of the shell, Pig also supports the way in which functions are encapsulated, so that we can reuse them, which is a good advantage compared to hive.
Let's look at the syntax for defining the pig function (also called a macro command):
DEFINE (Macros):
Supported parameters:
A scalar reference to alias Pig
Shaping (integer)
Float type (float)
Strings (String)
Let's take a look at a few examples so we can get to know them quickly and see our test data first:
Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>
-
1 , Zhang San, male, 23 , Chinese
-
2 , Zhang San, female, 32 , France
-
3 , Floret, male, 20 , UK
-
4 , Little Red, male, 16 , China
-
5 , Little Red, female, 25 , Luoyang
-
6 , Li Jing, female, 25 , Anyang, Henan, China
-
7 , Wang Qiang, male, 11 , UK
-
8 , Zhang Fei, male, 20 , United States
1, Zhang San, male, 23, China 2, Zhang San, female, 32, France 3, Floret, Male, 20, UK 4, red, male, 16, China 5, Little Red, female, 25, Luoyang 6, Li Jing, female, 25, China Henan Anyang 7, Wang Qiang, male, 11, UK 8, Zhang Fei, male, 20, United States
then look at the pig script:
Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>
--Define Pig function 1 support grouping statistics quantity
DEFINE Group_and_count (a,group_key,number_reduces) RETURNS B {
D = Group $A by $group _key parallel $number _reduces;
$B = foreach D generate Group, COUNT ($1);
};
--Define Pig function 2 to support sorting
--a Relation Reference scalar
--order_field a sorted field
--order_type How to sort desc? Asc?
HDFs path for--storedir storage
--Null return value
Define My_order (A,order_field,order_type,storedir) returns void {
D = Order $A by $order _field $order _type;
Store d into ' $storedir ' ;
};
--Define the Pig function 3, support filter filter, and call inside the macro command
--Define filter operations
Define Myfilter (A,field,count) returns b{
b= filter $A by $field > $count;
$B = Group_and_count (B,' sex ',1);
};
A = Load '/tmp/dongliang/318/person ' using Pigstorage (', ') as (ID:int, Name:chararray,sex: Chararray,age:int, address:chararray);
--------Pig function 1 test-----------------
--Define group by name
--BB = Group_and_count (A,name,1);
--Define the grouping by sex
--CC = Group_and_count (A,sex,1);
--dump BB;
--dump cc;
-------Pig function 2 test------------------
--Descending by age
--my_order (a,age,' desc ','/tmp/dongliang/318/z ');
--dump A;
-------Pig function 3 test------------------
--Filter Age greater than, and by sex, group statistics number
R = Myfilter (A,' age ',+);
Dump R;
--Define pig function 1 Support Group Statistics define group_and_count (a,group_key,number_reduces) returns b { d = group $A by $group _key parallel $number _reduces; $B = foreach d generate group, count ($);}; ---Define pig function 2 support sort--a Relation reference scalar--order_field sort field--order_type Sort by desc ? asc ?--storedir stored hdfs path--Null return value Define my_order (a,order_field,order_type,storedir) returns void { d = order $A by $order _field $order _type ; store d into ' $storedir ' ; }; --define the Pig function 3, Support filter filter, and macro command inside call--Define filter Operation define myfilter (A,field,count) returns B{ b= filter $A by $field > $count ; $B = Group_and_count (b, ' sex ', 1);}; a = load '/tmp/dongliang/318/person ' using pigstorage (', ') AS (Id:int,name:chararray,sex:chararray, Age:int,address:chararray) ;--------Pig function 1 Test-------------------definitions are grouped by name--bb = group_and_count (A, name,1);--Definition of gender grouping--cc = group_and_count (a,sex,1);--dump bb;--dump cc;------- The Pig function 2 tests--------------------by age descending--my_order (a,age, ' desc ', '/tmp/dongliang/318/z ');--dump a;------- Pig function 3 Test------------------ --filter age greater than 20, and by gender, group statistics number r = myfilter (A, ' age ',);d UMP r;
In the above script, the scatter fairy defines three functions,
(1) Number of grouped statistics
(2) Custom output storage
(3) Custom filtering and combining (1) Statistical quantity
Through these 3 examples, let everyone have a preliminary understanding of the pig function, the above functions and code are in a script, so it looks unfriendly, and reusability, has not been maximized, in fact, the function and the main script can be separated, and then use, we only need to import function script, can have all function functions, so that the function script is separated into the main script outside, it greatly increases the reusability of the function script, we can also reference in other scripts, but also in the function script can again refer to other function script, but the premise is not able, recursive reference, so pig syntax in the execution, will be error, the following look at the separated script file:
One: Function script file
Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>
--Define Pig function 1 support grouping statistics quantity
--a Relation Reference scalar
--group_key grouping fields
--Use the number of reduce
--Return the final citation result
DEFINE Group_and_count (a,group_key,number_reduces) RETURNS B {
D = Group $A by $group _key parallel $number _reduces;
$B = foreach D generate Group, COUNT ($1);
};
--Define Pig function 2 to support sorting
--a Relation Reference scalar
--order_field a sorted field
--order_type How to sort desc? Asc?
HDFs path for--storedir storage
--Null return value
Define My_order (A,order_field,order_type,storedir) returns void {
D = Order $A by $order _field $order _type;
Store d into ' $storedir ' ;
};
--Define the Pig function 3, support filter filter, and call inside the macro command
--a Relation Reference scalar
--field Filtered Fields
--count threshold value
--Return the final citation result
Define Myfilter (A,field,count) returns b{
b= filter $A by $field > $count;
$B = Group_and_count (B,' sex ',1);
};
[Search@dnode1 pigmacros]$
---Define pig function 1 support grouping statistics quantity--a relation reference scalar--group_key grouping field--using reduce number--returns the final citation result Define group_and_ count (a,group_key,number_reduces) RETURNS B { d = group $A by $group _key parallel $number _reduces; $B = foreach d Generate group, count ($);}; ---Define pig function 2 support sort--a Relation reference scalar--order_field sort field--order_type Sort by desc ? asc ?--storedir stored hdfs path--Null return value Define my_order (a,order_field,order_type,storedir) returns void { d = order $A by $order _field $order _type ; store d into ' $storedir ' ; }; --define the Pig function 3, Support filter filter, and macro command inside call--a Relationship Reference scalar--field filter field--count threshold value-return the final reference result Define myfilter (a,field,count) returns B{ b= filter $A by $field > , $count ; $B = group_and_count (B, ' sex ', 1);}; [[email protected] pigmacros]$
Two, the main script file
Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>
--Import the pig common Library of functions
Import ' Function.pig ' ;
A = Load '/tmp/dongliang/318/person ' using Pigstorage (', ') as (ID:int, Name:chararray,sex: Chararray,age:int, address:chararray);
--------Pig function 1 test-----------------
--Define group by name
--BB = Group_and_count (A,name,1);
--Define the grouping by sex
--CC = Group_and_count (A,sex,1);
--dump BB;
--dump cc;
-------Pig function 2 test------------------
--Descending by age
--my_order (a,age,' desc ','/tmp/dongliang/318/z ');
--dump A;
-------Pig function 3 test------------------
--Filter Age greater than, and by sex, group statistics number
R = Myfilter (A,' age ',+);
Dump R;
--Importing Pig common library import ' Function.pig '; a = Load '/tmp/dongliang/318/person ' using Pigstorage (', ') as (id:int,name: Chararray,sex:chararray,age:int,address:chararray);--------Pig function 1 Test-------------------definition grouped by name--BB = Group_and_ Count (a,name,1);--Define by sex Group--CC = Group_and_count (a,sex,1);--dump bb;--dump cc;------- The Pig function 2 tests--------------------by age descending--my_order (a,age, ' desc ', '/tmp/dongliang/318/z ');--dump A;------- Pig function 3 Tests--------------------filter age greater than 20, and by sex, group statistics r = Myfilter (A, ' age ', 20); Dump R;
It is important to note that the imported function file needs to be enclosed in single quotation marks, so that we have completed the reuse of the pig function, is it very similar to the syntax of the shell? Interested students, hurry to experience a bar!
This article is from the "7936494" blog, please be sure to keep this source http://7946494.blog.51cto.com/7936494/1622052
Play the big data series of Apache Pig advanced skills Function programming (vi)