Pig practice summary of commonly used pig syntax, teach you quick start-Algorithm

Source: Internet
Author: User

This article mainly comes from Google's self-Summary, and summarizes the commonly used pig syntax, that is, "algorithm ".

Basic concepts:
Relation bag tuple field data
Relation = bag)
A package is a collection of tuple. In pig, {} is used to expand a package.
A tuple is an ordered set of several fields. In the pig data structure, () is expanded to identify a tuple.
A field is the identifier of column data;

Ing with the database:
Pig Database
Relation/bag table
Tuple one record
Field field (one volume)

However, the number of tuple fields in pig is random, which is different from that in the database.

Run Mode and notes:
1. Running Mode
A) Local
I. Pig-x local
B) Cluster
I. Pig-x mapreduce
Ii. Or pig
C) Batch Processing Command
I. Write commands in batches to XX. Pig
Ii. Run in local or cluster mode, such as pig XX. Pig

2. Notes
A) Paragraph notes :/**/
B) line comment :--

Basic usage framework:
A) Input
I. A = load 'input path 'using pigstorage ("\ t") as (Name: chararray, age: int, GPA: Float );
II. logic: use pig's pigstorage to read data in the input path and separate each row with "\ t, use the string type name, int type age, float type gpq as the field name of each row of data
B) intermediate processing
I. B = foreach A generate name;
C) Output
I. Dump B; -- output to the console
Ii. or store B into 'output path 'using pigstorage ();

Basic syntax example:
1. An actual example of calculating the average value under a multi-dimensional combination
Requirement: calculate the average value of the 2nd and 5 columns of the 4th, 3, and 4 combinations in the file.
Script explanation:
A = load '1.txt 'using pigstorage ('') as (col1: chararray, col2: int, col3: int, col4: int, col5: Double, col6: Double );
Split each line of data in 1.txt with "", and then install (col1: chararray, col2: int, col3: int, col4: int, col5: Double, col6: Double) to parse the data, and load it to each bag/tuple/filed
Col1 is the alias of each column. If it is not specified, you can use $0, $ n to index the column, but the readability is poor.

The structure of a is {col1: chararray, col2: int, col3: int, col4: int, col5: Double, col6: Double )}

B = group A by (col2, col3, col4 );
Group A with a combination of col2, col3, and col4, and then combine each tuple into a bag by group,
The data structure of B is B: {group :( col2, col3, col4), A: Bag {: tuple, tuple }}

C = foreach B generate group, AVG (col5), AVG (col6 );
Foreach is to traverse the tuple of the collection set in each group and process fields with the combined function.
The structure of C is C: {group :( col2, col3, col4), double, double}

Dump C; // print debugging information for standalone running
Store C into "output"; // store the result

2. Number of Statistics rows
A = load '1.txt 'using pigstorage ('') as (col1: chararray, col2: int, col3: int, col4: int, col5: Double, col6: Double );
B = group A all;
C = foreach B generate count (col2 );
Dump C;
Note: If the column col2 is null, this row will not be counted.
Group A all; no by keyword.
If you want to include null values in the count computation, use count_star.

3. Flatten
The example in the same 1, but C is different
C = foreach B generate flatten (group), AVG (col5), AVG (col6 );
The result structure obtained from the example in 1 is (col2, col3, col4), avg1, avg2
However, after the group is flatten, the structure becomes col2, col3, col4, avg1, and avg2. We can see that flatten is used to solve the nested (un-nest) package or the tuple,

4. Two notes about group:
A) if the key used for group has more than one field (as in the previous example), the key of the data after group is a tuple ), otherwise, it is the same type as the key used for the group.
B) The result obtained after group is a package or relation. each tuple contains two fields. The first field is named 'group', and the second field is a package, is the set of all tuple containing the group value.

5. load data as tuple
Syntax: A = load '1.txt 'As (T: tuple (col1: chararray, col2: int, col3: int, col4: int, col5: Double, col6: Double );
Use describe a; to view the structure of.
Note: The input data must be in the (XX, XX, XX... xx) structure to be loaded in the form of tuples. This is suitable for further processing of intermediate results.

6. Calculate the number of records that are not repeated in a multidimensional combination.
C = foreach B generate (D = distinct col5; group, count (d ));
Note: distinct is followed by a field. If this field is different, it must not be repeated. If it is repeated, this field must be the same, and D should be renamed with an equal sign, followed;

7. Convert relation into Table volume scalar

8. Use shell scripts in pig for auxiliary Processing
A = load '2.txt 'As (col1: int, col2: int, col3: int, col4: chararray, col5: INT );
B = Stream A through 'awk' {if ($4 = "= ") print $1 "\ t" $2 "\ t" $3 "\ t" 999 "\ t" $5; else Print $0 }'';
Dump B;
Purpose: Replace "=" in column 4th with 999

9. Pass parameters to the pig script:
Syntax: Pig-Param output_dir = "/XXX" XXX. Pig;
In XXX. Pig, store d into '$ output_dir ';

10. Role of cogroup:
Group packages in two links by specified fields.
A = load '3.txt 'As (acol1: chararray, acol2: int, acol3: INT );
B = load '4.txt 'As (bcol1: int, bcol2: chararray, bcol3: INT );
C = cogroup A by A. acol1, B by B. bcol2;
Dump C;
Structure of the result {key, {A's one tuple or null}, {B's ont tuple or null }};
Group A and B by the values of the first row of A and the second row of B. If tuple IN A contains this key, it is displayed in the result, if not, it is {}, and B is the same.

11. piggybank
The name of the custom function is piggybank.

12. The UDF constructor will be called multiple times, so be especially careful when doing a job.

13. load data under multiple directories
Load '/data/201 {1, 2 }'
Load/data/2011/data/2012 data under the two directories

14. The group function does not ignore null values, while count does

15. count the number of combinations of certain fields:
A = load '15.txt 'As (col1: int, col2: INT );
B = group A by (col1, col2 );
C = Group B all;
D = foreach C generate count (B );
Dump D;
Idea: group the data as needed, and then group the count;
Note:
A) In D, count is B, because C is used to group all data into a new relationship, which only contains one tuple. The first field is all, the second field is a bag composed of all tuple of B.
B) The Count object must be a bag. Therefore, before calculating the field, use group x all to make all tuple in X into a bag.

16. How do I obtain a float when two integers are separated?
A = load '16.txt 'As (col1: int, col2: INT );
B = foreach A generate (float) col1/col2;
Dump B;
Note that the first transformation is in computing, rather than (float) (col1/col2 );

17. Calculate the Union data
A = load '1.txt 'As (col1: int, col2: INT );
B = load '2.txt 'As (col3: int, col4: INT );
C = Union A, B;
D = Group C by $0;
E = foreach D generate group, sum (C. $1 );
# E = foreach D generate flatten (group), sum (C. $1 );
Dump E;
Note:
The union operation adds a new tuple to the link. After union, observe the reference method of filed. For details, see structure.
When multiple links exist, you must specify the Link name for Column Operations.

18. Regular Expression Filtering
Filter out URLs that match * // * .qq.com /#
A = load '18.txt 'As (col1: int, col2: chararray );
B = Filter A by col2 matches '. * //. * \. QQ \. com /.*';
Note:. * The logo must contain at least one character.
The matching. character must be escaped \., while the escape in ''must be two \\.;

19. truncation string:
Substring (col1, 0, 4): 0 is startindex, 4 is stopindex, does not contain stopindex;

A = load '19.txt 'As (datastr: chararray, col2: chararray );
B = foreach A generate substring (datastr, 0, 4 );
C = distinct B;
Dump C;

20. connection string:
A = load '20.txt 'As (col1: chararray, col2: INT );
B = foeach A generate Concat (col1, (chararray) col2 );
Dump B;
Note: The connected field type must be chararray. If not
Nested use: Concat (A, Concat (B, c );

21. Use join to calculate the intersection of the two datasets & An error occurred while performing the intersection of different data types
A = load '211.txt 'As (A: INT );
B = load '212.txt 'As (B: INT );
C = join a by a, B by B;
D = Group C all;
E = foreach D generate count (C );
Dump E;
After join, the structure of C is: C: {A: int, B: int}

Deduplication:
A = load '211.txt 'As (A: INT );
B = load '212.txt 'As (B: INT );
C = join a by a, B by B;
Uniq_c = distinct (C );
D = group uniq_c all;
E = foreach D generate count (C );
Dump E;

22. The "()" operator must be used to replace null values.
B = foreach A generate (col1 is null )? -1: col1)

23. How to fill in the blank after obtaining the calculation result
A = load '23.txt 'As (col1: int, B: tuple (col2: int, col3: INT );
B = foreach A generate col1, flatten (B );
C = Group B by B. col1;
D = foreach C generate group, sum (B. col1) as sum;
E = foreach D generate group, (sum is null )? 0: Sum );
Dump E;

24. The distinct operation is used to remove duplicates. Only when the dataset is merged can we know which data is duplicated. Therefore, the reduce process is generated. At the same time, in the map stage, it will also use combiner to remove a portion of duplicate data to speed up processing.

25. Improve pig job priority: Set job. Priority high; improve pig job running speed

26. "scalars can be only used with projections" error and nested/inner foreach
In each combination of the first column, the second column shows the number of data records of 3/6.

A = load '26.txt 'As (col1: chararray, col2: INT );
B = group A by col1;
C = foreach B {
D = Filter A by col2 = 3;
E = Filter A by col2 = 6;
Generate group, count (D), count (e );};
Dump C;

27. In grunt mode, press Ctrl + A and CTRL + e to replace home and end, and you can jump to the beginning and end of the line.

28. A join connection must be imported twice for the same link. Otherwise, an error occurs.

29. Outer Join
Left: full display of data on the left

A = load '291.txt 'As (col1: int, col2: chararray );
B = load '292.txt 'As (col1: int, col2: chararray );
C = join a by col1 left, B by col1;
Describe C;
Dump C;
This is consistent with the left and right connections of the database and the internal connections.

30. Chinese characters can be filtered in pig, but not in interactive mode.

31. count the number of fields in tuple, the number of tuple in bag, and the number of key/value groups in map using the size function

32. If the character is null, Col is null is used for determination, but "", "cannot be filtered to use size (XX)> 0.
Filter A by (col1 is not null and (SIZE (col2)> 0 ));

33. Operators in pig that will trigger the reduce Process
GROUP: because the group Operation collects all records with the same key, if the data is being processed in map, the shuffle → reduce process is triggered.
Order: because all equal records need to be collected together (sorted), order triggers the reduce process. In addition to the pig job you write, pig also adds an additional M-R job to your data flow, because pig needs to sample your data set, to determine the distribution of data, so as to solve the problem that the job efficiency is too low when the data distribution is seriously uneven.
Distinct: Because records need to be collected together to determine if they are duplicated, distinct triggers the reduce process. Of course, distinct will also use combiner to remove duplicate records in the map stage.
Join: Join is used to calculate the coincidence. When the coincidence is obtained, records with the same key must be collected together. Therefore, join triggers the reduce process.
Limit: Because records need to be collected together to count the number of records it returns, limit triggers the reduce process.
Cogroup: similar to group (see the previous section), so it triggers the reduce process.
Cross: Calculate the cross product of two or more relations.

34. How to count the specified characters in a string
Shell script awk-F ":" '{print NF-1 }'
Split the string with: and print the total number of volume

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.