[db] Hadoop && Pig

Last Update:2017-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop && Pighadoop

Recently need to use the Hadoop operation, found that the website of Hadoop really conscience, not so much nonsense, directly understand how to use, but also Chinese, simple rough ah!!!
Hadoop document
In MapReduce, the output of map has automatic sorting function!!!

Pig

There is also a pig language, a process language, similar to MySQL (this thing is not familiar ~ awkward).
Summarize the use of pig, incidentally experiment example.

Pig data type

Double | float | Long | int | Chararry
| ByteArray
Tuple | Bag | Map
Tuple similar to Matlab cell, element type can be different (' haha ', 1)
Bag equals a set of tuple, denoted by {}, {(' haha ', 1), (' hehe ', 2)}
field is the column data ID
Map equals hash table, key is chararray,value for any type

Running and commenting

运行：本地模式：pig -x local集群模式：pig -x mapreduce 或者 pig批处理pig文件，上两行命令后接pig文件名，pig xx.pig注释：行注释 --段注释 /**/

Pig Latin

>> cat 1.txta 1 2 3 4.2 9.8a 3 0 5 3.5 2.1b 7 9 9 - -a 7 9 9 2.6 6.2a 1 2 5 7.7 5.9a 1 2 3 1.4 0.2

[] represents an option
Pig command is case insensitive

LOAD

load = LOAD ‘data_path‘ [USING function] [AS schema]A = LOAD ‘1.txt‘ USING PigStorage(‘ ‘) AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);

The data of each row of 1.txt is divided into the corresponding col1,col2 column name for data parsing, if not specified, it can be $0 $n indexed.
The pig defaults PigStorage() to reading the local disk or Hadoop path data, org.apache.hcatalog.pig.HCatLoader() reading the hive table, and () the delimiter.

GROUP

B = GROUP A BY (col2, col3, col4);out:((1,2,3),{(a,1,2,3,1.4,0.2),(a,1,2,3,4.2,9.8)})((1,2,5),{(a,1,2,5,7.7,5.9)})((3,0,5),{(a,3,0,5,3.5,2.1)})((7,9,9),{(a,7,9,9,2.6,6.2),(b,7,9,9,,)})

Use (col2, col3, COL4) to group A, and then group each tuple into one bag. B:{group: (COL2,COL3,COL4), a:bag:{tuple,tuple}}
Group grouping operations, grouping data into Group_col:bag, the first field is named ' Group ', and the second field is bag, which contains all the tuple values that correspond to ' group '.

Foreach

C = FOREACH B GENERATE group, AVG(A.col5), AVG(A.col6);out:((1,2,3),2.8,5.0)((1,2,5),7.7,5.9)((3,0,5),3.5,2.1)((7,9,9),2.6,6.2)其中(1.4+4,2)/2=2.8

foreach is a tuple that iterates through each group and processes it.
AVG(平均)，COUNT(计数),MIN,MAXBasic and Excel abbreviations are consistent.
C:{group: (COL2,COL3,COL4), double,double}.
Typically foreach and generate are used in a piece, and it is recommended to use the Foreach generate to filter out excess information as early as possible to reduce data exchange when the data is large.

FILTER

d = filter A by $0 == ‘a‘;out:(a,1,2,3,4.2,9.8)(a,3,0,5,3.5,2.1)(a,7,9,9,2.6,6.2)(a,1,2,5,7.7,5.9)(a,1,2,3,1.4,0.2)

Filter data as required, you can connect multiple conditions with and or to filter out useless informationnull 0 -1...

CONCAT & SUBSTRING

B = FOREACH A GENERATE CONCAT($0, (chararray)$1,(chararray)$2,(chararray)$3);out:(a123)(a305)(b799)(a799)(a125)(a123)C = foreach B generate (chararray)SUBSTRING($0,0,2);out:(a1)(a3)(b7)(a7)(a1)(a1)

Concat stitching two strings, substring the string by length [0,2) left closed right open interval.

Display and storage

DUMP C;STORE C INTO ‘output_path‘

Dump shows all the out parts above. Store is storage.

Join,union,cogroup,cross

a.txt:(2,Tie)(4,Coat)(3,Hat)(1,Scarf)b.txt:(Joe,2)(Hank,4)(Ali,0)(Eve,3)(Hank,2)A = LOAD ‘a.txt‘ USING PigStorage(‘,‘);B = LOAD ‘b.txt‘ USING PigStorage(‘,‘);

JOIN

C = JOIN A BY $0, B BY $1;out:(2,Tie,Hank,2)(2,Tie,Joe,2)(3,Hat,Eve,3)(4,Coat,Hank,4)

Follow the key to get the line added. Inner join, generally with a small table join B large table, play a part of the role of filtering.
There's another one.left join: left outer

UNION

D = UNION A, B;out:(Joe,2)(Hank,4)(Ali,0)(Eve,3)(Hank,2)(2,Tie)(4,Coat)(3,Hat)(1,Scarf)

You can perform a union operation on datasets of different field numbers.

Cogroup

E = COGROUP A BY $0, B BY $1;E = COGROUP A BY $0, B BY $1 outer;out:(0,{},{(Ali,0)})(1,{(1,Scarf)},{})(2,{(2,Tie)},{(Hank,2),(Joe,2)})(3,{(3,Hat)},{(Eve,3)})(4,{(4,Coat)},{(Hank,4)})F = COGROUP A BY $0 inner, B BY $1;out:(1,{(1,Scarf)},{})(2,{(2,Tie)},{(Hank,2),(Joe,2)})(3,{(3,Hat)},{(Eve,3)})(4,{(4,Coat)},{(Hank,4)})

Outputs a set of nested tuple structures. Cogroup generates a tuple for each of the different keys. The first field of each tuple is key. The other fields are the bag of tuples that match the key value in each relationship. The first bag is a matched tuple in a, the second bag is B, and no match is empty {}.
Cogroup The default type of outer connection.

Cross

F = CROSS A, B;out:(1,Scarf,Hank,2)(1,Scarf,Eve,3)(1,Scarf,Ali,0)(1,Scarf,Hank,4)(1,Scarf,Joe,2)(3,Hat,Hank,2)(3,Hat,Eve,3)(3,Hat,Ali,0)(3,Hat,Hank,4)(3,Hat,Joe,2)(4,Coat,Hank,2)(4,Coat,Eve,3)(4,Coat,Ali,0)(4,Coat,Hank,4)(4,Coat,Joe,2)(2,Tie,Hank,2)(2,Tie,Eve,3)(2,Tie,Ali,0)(2,Tie,Hank,4)(2,Tie,Joe,2)

Cross Cartesian product. Each tuple in the first relationship is concatenated with all tuples in the second one. The size of the output of this operation is the product of the size of the input relationship.

For the time being so much, the follow-up may be replenished.

Reference

http://blackproof.iteye.com/blog/1791980
Http://www.360doc.com/content/15/0520/20/13670635_472030452.shtml
http://blog.csdn.net/gg584741/article/details/51712242
https://www.codelast.com/%E5%8E%9F%E5%88%9Bpig%E4%B8%AD%E7%9A%84%E4%B8%80%E4%BA%9B%E5%9F%BA%E7%A1%80%E6%A6%82% e5%bf%b5%e6%80%bb%e7%bb%93/
Http://www.aboutyun.com/thread-6713-1-1.html

[db] Hadoop && Pig

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[db] Hadoop && Pig

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[db] Hadoop && Pig

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support