Play the big data series of Apache Pig advanced skills Function programming (vi)

Source: Internet
Author: User
Tags apache solr scalar hadoop ecosystem

Original is not easy, reproduced please be sure to indicate, original address, thank you for your cooperation!
http://qindongliang.iteye.com/


Pig series of learning documents, hope to be useful to everyone, thanks for the attention of the scattered fairy!

Apache Pig's past life

How does Apache pig customize UDF functions?

Apache Pig5 Line code How to implement Hadoop WordCount?

Apache Pig Getting Started learning document (i)

Apache Pig Study notes (ii)

Apache Pig Learning notes Built-in functions (iii)

Play the Big data series Apache Pig how to integrate with Apache Lucene (i)

Play the Big data series Apache Pig how to integrate with Apache SOLR (ii)

Play turn Big data series Apache Pig how to integrate with MySQL (iii)

How to play the Big data series How to give Apache Pig custom storage form (iv)

Play the Big data series Apache pig how to query a database with a custom UDF (v)

How to use pig Set component Word device to count news frequency?


650) this.width=650; "Src=" http://dl2.iteye.com/upload/attachment/0106/7743/ Bdb45a8c-0783-3da4-81ae-cf969e75f38b.png "alt=" Bdb45a8c-0783-3da4-81ae-cf969e75f38b.png "/>

In the Hadoop Ecosystem, if we were to analyze massive amounts of data off-line, most people would choose Apache hive or Apache Pig, and in general, Hive used a higher percentage of the population, while Pig used a relatively is much less, this is not because pig is immature and unstable, but because Hive provides query statements for class database SQL, which makes it very easy for most people to get started with hive, whereas pig provides scripting syntax for the class Linux shell, which makes most people dislike it.

If in the programming world, the statistics will be SQL and shell, that the proportion of large, scattered fairy think, no doubt certainly is the SQL statement. Because there are quite a few programmers who don't use Linux, Microsoft's set of dedicated servers from C # to Asp.net,sql server to Windows.



650) this.width=650; "Src=" http://dl2.iteye.com/upload/attachment/0105/3491/ 7c7b3bef-0dda-3ac6-8cdb-1ecc1dd9c194.jpg "alt=" 7c7b3bef-0dda-3ac6-8cdb-1ecc1dd9c194.jpg "/>

OK, pull away, hurry back, use the shell of the Siege division, I think will fall in love with it, because in the Linux system, there is no more concise and easy to use than the shell, if coupled with awk and sed more powerful.

We all know that the shell is a support function call, and this is very similar to JavaScript, by defining functions we can reuse a function, instead of a large number of coding, which, the change of things, separated into parameters, immutable things defined as statements, so since, Can reduce the redundancy and complexity of coding, imagine how incredible it would be if there were no methods in Java.

As the language of the shell, Pig also supports the way in which functions are encapsulated, so that we can reuse them, which is a good advantage compared to hive.

Let's look at the syntax for defining the pig function (also called a macro command):

DEFINE (Macros):
Supported parameters:
A scalar reference to alias Pig
Shaping (integer)
Float type (float)
Strings (String)

Let's take a look at a few examples so we can get to know them quickly and see our test data first:

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

    1. 1 , Zhang San, male, 23 , Chinese   

    2. 2 , Zhang San, female, 32 , France   

    3. 3 , Floret, male, 20 , UK   

    4. 4 , Little Red, male, 16 , China   

    5. 5 , Little Red, female, 25 , Luoyang   

    6. 6 , Li Jing, female, 25 , Anyang, Henan, China   

    7. 7 , Wang Qiang, male, 11 , UK   

    8. 8 , Zhang Fei, male, 20 , United States   

1, Zhang San, male, 23, China 2, Zhang San, female, 32, France 3, Floret, Male, 20, UK 4, red, male, 16, China 5, Little Red, female, 25, Luoyang 6, Li Jing, female, 25, China Henan Anyang 7, Wang Qiang, male, 11, UK 8, Zhang Fei, male, 20, United States



then look at the pig script:

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

  1. --Define Pig function 1 support grouping statistics quantity

  2. DEFINE Group_and_count (a,group_key,number_reduces) RETURNS B {

  3. D = Group $A by $group _key parallel $number _reduces;

  4. $B = foreach D generate Group, COUNT ($1);

  5. };

  6. --Define Pig function 2 to support sorting

  7. --a Relation Reference scalar

  8. --order_field a sorted field

  9. --order_type How to sort desc? Asc?

  10. HDFs path for--storedir storage

  11. --Null return value

  12. Define My_order (A,order_field,order_type,storedir) returns void {

  13. D = Order $A by $order _field $order _type;

  14. Store d into ' $storedir ' ;

  15. };

  16. --Define the Pig function 3, support filter filter, and call inside the macro command

  17. --Define filter operations

  18. Define Myfilter (A,field,count) returns b{

  19. b= filter $A by $field > $count;

  20. $B = Group_and_count (B,' sex ',1);

  21. };

  22. A = Load '/tmp/dongliang/318/person ' using Pigstorage (', ') as (ID:int, Name:chararray,sex: Chararray,age:int, address:chararray);

  23. --------Pig function 1 test-----------------

  24. --Define group by name

  25. --BB = Group_and_count (A,name,1);

  26. --Define the grouping by sex

  27. --CC = Group_and_count (A,sex,1);

  28. --dump BB;

  29. --dump cc;

  30. -------Pig function 2 test------------------

  31. --Descending by age

  32. --my_order (a,age,' desc ','/tmp/dongliang/318/z ');

  33. --dump A;

  34. -------Pig function 3 test------------------

  35. --Filter Age greater than, and by sex, group statistics number

  36. R = Myfilter (A,' age ',+);

  37. Dump R;

--Define pig function 1  Support Group Statistics define group_and_count  (a,group_key,number_reduces)  returns b  {  d = group  $A  by  $group _key parallel  $number _reduces;    $B  = foreach d generate group, count ($);}; ---Define pig function 2  support sort--a  Relation reference scalar--order_field  sort field--order_type  Sort by  desc ? asc  ?--storedir  stored hdfs path--Null return value Define my_order (a,order_field,order_type,storedir)  returns  void {   d = order  $A  by  $order _field  $order _type  ;  store  d into  ' $storedir '  ;   }; --define the Pig function 3, Support filter filter, and macro command inside call--Define filter Operation define  myfilter  (A,field,count)  returns B{    b= filter  $A  by  $field  >  $count  ;    $B  =  Group_and_count (b, ' sex ', 1);}; a = load   '/tmp/dongliang/318/person '  using pigstorage (', ')  AS  (Id:int,name:chararray,sex:chararray, Age:int,address:chararray)  ;--------Pig function 1 Test-------------------definitions are grouped by name--bb = group_and_count (A, name,1);--Definition of gender grouping--cc = group_and_count (a,sex,1);--dump bb;--dump cc;------- The Pig function 2 tests--------------------by age descending--my_order (a,age, ' desc ', '/tmp/dongliang/318/z ');--dump a;------- Pig function 3 Test------------------ --filter age greater than 20, and by gender, group statistics number  r =  myfilter (A, ' age ',);d UMP  r;



In the above script, the scatter fairy defines three functions,
(1) Number of grouped statistics
(2) Custom output storage
(3) Custom filtering and combining (1) Statistical quantity

Through these 3 examples, let everyone have a preliminary understanding of the pig function, the above functions and code are in a script, so it looks unfriendly, and reusability, has not been maximized, in fact, the function and the main script can be separated, and then use, we only need to import function script, can have all function functions, so that the function script is separated into the main script outside, it greatly increases the reusability of the function script, we can also reference in other scripts, but also in the function script can again refer to other function script, but the premise is not able, recursive reference, so pig syntax in the execution, will be error, the following look at the separated script file:

One: Function script file

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

  1. --Define Pig function 1 support grouping statistics quantity

  2. --a Relation Reference scalar

  3. --group_key grouping fields

  4. --Use the number of reduce

  5. --Return the final citation result

  6. DEFINE Group_and_count (a,group_key,number_reduces) RETURNS B {

  7. D = Group $A by $group _key parallel $number _reduces;

  8. $B = foreach D generate Group, COUNT ($1);

  9. };

  10. --Define Pig function 2 to support sorting

  11. --a Relation Reference scalar

  12. --order_field a sorted field

  13. --order_type How to sort desc? Asc?

  14. HDFs path for--storedir storage

  15. --Null return value

  16. Define My_order (A,order_field,order_type,storedir) returns void {

  17. D = Order $A by $order _field $order _type;

  18. Store d into ' $storedir ' ;

  19. };

  20. --Define the Pig function 3, support filter filter, and call inside the macro command

  21. --a Relation Reference scalar

  22. --field Filtered Fields

  23. --count threshold value

  24. --Return the final citation result

  25. Define Myfilter (A,field,count) returns b{

  26. b= filter $A by $field > $count;

  27. $B = Group_and_count (B,' sex ',1);

  28. };

  29. [Search@dnode1 pigmacros]$

---Define pig function 1  support grouping statistics quantity--a  relation reference scalar--group_key  grouping field--using reduce number--returns the final citation result Define group_and_ count  (a,group_key,number_reduces)  RETURNS B {  d = group  $A  by  $group _key parallel  $number _reduces;   $B  = foreach d  Generate group, count ($);}; ---Define pig function 2  support sort--a  Relation reference scalar--order_field  sort field--order_type  Sort by  desc ? asc  ?--storedir  stored hdfs path--Null return value Define my_order (a,order_field,order_type,storedir)  returns  void {   d = order  $A  by  $order _field  $order _type  ;  store  d into  ' $storedir '  ;   }; --define the Pig function 3, Support filter filter, and macro command inside call--a  Relationship Reference scalar--field  filter field--count  threshold value-return the final reference result Define  myfilter   (a,field,count)  returns B{   b= filter  $A  by  $field   >&nbsp, $count  ;    $B  = group_and_count (B, ' sex ', 1);}; [[email protected] pigmacros]$



Two, the main script file

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

  1. --Import the pig common Library of functions

  2. Import ' Function.pig ' ;

  3. A = Load '/tmp/dongliang/318/person ' using Pigstorage (', ') as (ID:int, Name:chararray,sex: Chararray,age:int, address:chararray);

  4. --------Pig function 1 test-----------------

  5. --Define group by name

  6. --BB = Group_and_count (A,name,1);

  7. --Define the grouping by sex

  8. --CC = Group_and_count (A,sex,1);

  9. --dump BB;

  10. --dump cc;

  11. -------Pig function 2 test------------------

  12. --Descending by age

  13. --my_order (a,age,' desc ','/tmp/dongliang/318/z ');

  14. --dump A;

  15. -------Pig function 3 test------------------

  16. --Filter Age greater than, and by sex, group statistics number

  17. R = Myfilter (A,' age ',+);

  18. Dump R;

--Importing Pig common library import ' Function.pig '; a = Load '/tmp/dongliang/318/person ' using Pigstorage (', ') as (id:int,name: Chararray,sex:chararray,age:int,address:chararray);--------Pig function 1 Test-------------------definition grouped by name--BB = Group_and_ Count (a,name,1);--Define by sex Group--CC = Group_and_count (a,sex,1);--dump bb;--dump cc;------- The Pig function 2 tests--------------------by age descending--my_order (a,age, ' desc ', '/tmp/dongliang/318/z ');--dump A;------- Pig function 3 Tests--------------------filter age greater than 20, and by sex, group statistics r = Myfilter (A, ' age ', 20); Dump R;


It is important to note that the imported function file needs to be enclosed in single quotation marks, so that we have completed the reuse of the pig function, is it very similar to the syntax of the shell? Interested students, hurry to experience a bar!


This article is from the "7936494" blog, please be sure to keep this source http://7946494.blog.51cto.com/7936494/1622052

Play the big data series of Apache Pig advanced skills Function programming (vi)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.