[db] Hadoop && Pig

Source: Internet
Author: User

Hadoop && Pighadoop

Recently need to use the Hadoop operation, found that the website of Hadoop really conscience, not so much nonsense, directly understand how to use, but also Chinese, simple rough ah!!!
Hadoop document
In MapReduce, the output of map has automatic sorting function!!!

Pig

There is also a pig language, a process language, similar to MySQL (this thing is not familiar ~ awkward).
Summarize the use of pig, incidentally experiment example.

Pig data type
    • Double | float | Long | int | Chararry
      | ByteArray
    • Tuple | Bag | Map
      Tuple similar to Matlab cell, element type can be different (' haha ', 1)
      Bag equals a set of tuple, denoted by {}, {(' haha ', 1), (' hehe ', 2)}
      field is the column data ID
      Map equals hash table, key is chararray,value for any type
Running and commenting
运行:本地模式:pig -x local集群模式:pig -x mapreduce 或者 pig批处理pig文件,上两行命令后接pig文件名,pig xx.pig注释:行注释 --段注释 /**/
Pig Latin
>> cat 1.txta 1 2 3 4.2 9.8a 3 0 5 3.5 2.1b 7 9 9 - -a 7 9 9 2.6 6.2a 1 2 5 7.7 5.9a 1 2 3 1.4 0.2

[] represents an option
Pig command is case insensitive

LOAD
load = LOAD ‘data_path‘ [USING function] [AS schema]A = LOAD ‘1.txt‘ USING PigStorage(‘ ‘) AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);

The data of each row of 1.txt is divided into the corresponding col1,col2 column name for data parsing, if not specified, it can be $0 $n indexed.
The pig defaults PigStorage() to reading the local disk or Hadoop path data, org.apache.hcatalog.pig.HCatLoader() reading the hive table, and () the delimiter.

GROUP
B = GROUP A BY (col2, col3, col4);out:((1,2,3),{(a,1,2,3,1.4,0.2),(a,1,2,3,4.2,9.8)})((1,2,5),{(a,1,2,5,7.7,5.9)})((3,0,5),{(a,3,0,5,3.5,2.1)})((7,9,9),{(a,7,9,9,2.6,6.2),(b,7,9,9,,)})

Use (col2, col3, COL4) to group A, and then group each tuple into one bag. B:{group: (COL2,COL3,COL4), a:bag:{tuple,tuple}}
Group grouping operations, grouping data into Group_col:bag, the first field is named ' Group ', and the second field is bag, which contains all the tuple values that correspond to ' group '.

Foreach
C = FOREACH B GENERATE group, AVG(A.col5), AVG(A.col6);out:((1,2,3),2.8,5.0)((1,2,5),7.7,5.9)((3,0,5),3.5,2.1)((7,9,9),2.6,6.2)其中(1.4+4,2)/2=2.8

foreach is a tuple that iterates through each group and processes it.
AVG(平均),COUNT(计数),MIN,MAXBasic and Excel abbreviations are consistent.
C:{group: (COL2,COL3,COL4), double,double}.
Typically foreach and generate are used in a piece, and it is recommended to use the Foreach generate to filter out excess information as early as possible to reduce data exchange when the data is large.

FILTER
d = filter A by $0 == ‘a‘;out:(a,1,2,3,4.2,9.8)(a,3,0,5,3.5,2.1)(a,7,9,9,2.6,6.2)(a,1,2,5,7.7,5.9)(a,1,2,3,1.4,0.2)

Filter data as required, you can connect multiple conditions with and or to filter out useless informationnull 0 -1...

CONCAT & SUBSTRING
B = FOREACH A GENERATE CONCAT($0, (chararray)$1,(chararray)$2,(chararray)$3);out:(a123)(a305)(b799)(a799)(a125)(a123)C = foreach B generate (chararray)SUBSTRING($0,0,2);out:(a1)(a3)(b7)(a7)(a1)(a1)

Concat stitching two strings, substring the string by length [0,2) left closed right open interval.

Display and storage
DUMP C;STORE C INTO ‘output_path‘

Dump shows all the out parts above. Store is storage.

Join,union,cogroup,cross
a.txt:(2,Tie)(4,Coat)(3,Hat)(1,Scarf)b.txt:(Joe,2)(Hank,4)(Ali,0)(Eve,3)(Hank,2)A = LOAD ‘a.txt‘ USING PigStorage(‘,‘);B = LOAD ‘b.txt‘ USING PigStorage(‘,‘);
JOIN
C = JOIN A BY $0, B BY $1;out:(2,Tie,Hank,2)(2,Tie,Joe,2)(3,Hat,Eve,3)(4,Coat,Hank,4)

Follow the key to get the line added. Inner join, generally with a small table join B large table, play a part of the role of filtering.
There's another one.left join: left outer

UNION
D = UNION A, B;out:(Joe,2)(Hank,4)(Ali,0)(Eve,3)(Hank,2)(2,Tie)(4,Coat)(3,Hat)(1,Scarf)

You can perform a union operation on datasets of different field numbers.

Cogroup
E = COGROUP A BY $0, B BY $1;E = COGROUP A BY $0, B BY $1 outer;out:(0,{},{(Ali,0)})(1,{(1,Scarf)},{})(2,{(2,Tie)},{(Hank,2),(Joe,2)})(3,{(3,Hat)},{(Eve,3)})(4,{(4,Coat)},{(Hank,4)})F = COGROUP A BY $0 inner, B BY $1;out:(1,{(1,Scarf)},{})(2,{(2,Tie)},{(Hank,2),(Joe,2)})(3,{(3,Hat)},{(Eve,3)})(4,{(4,Coat)},{(Hank,4)})

Outputs a set of nested tuple structures. Cogroup generates a tuple for each of the different keys. The first field of each tuple is key. The other fields are the bag of tuples that match the key value in each relationship. The first bag is a matched tuple in a, the second bag is B, and no match is empty {}.
Cogroup The default type of outer connection.

Cross
F = CROSS A, B;out:(1,Scarf,Hank,2)(1,Scarf,Eve,3)(1,Scarf,Ali,0)(1,Scarf,Hank,4)(1,Scarf,Joe,2)(3,Hat,Hank,2)(3,Hat,Eve,3)(3,Hat,Ali,0)(3,Hat,Hank,4)(3,Hat,Joe,2)(4,Coat,Hank,2)(4,Coat,Eve,3)(4,Coat,Ali,0)(4,Coat,Hank,4)(4,Coat,Joe,2)(2,Tie,Hank,2)(2,Tie,Eve,3)(2,Tie,Ali,0)(2,Tie,Hank,4)(2,Tie,Joe,2)

Cross Cartesian product. Each tuple in the first relationship is concatenated with all tuples in the second one. The size of the output of this operation is the product of the size of the input relationship.

For the time being so much, the follow-up may be replenished.

Reference

http://blackproof.iteye.com/blog/1791980
Http://www.360doc.com/content/15/0520/20/13670635_472030452.shtml
http://blog.csdn.net/gg584741/article/details/51712242
https://www.codelast.com/%E5%8E%9F%E5%88%9Bpig%E4%B8%AD%E7%9A%84%E4%B8%80%E4%BA%9B%E5%9F%BA%E7%A1%80%E6%A6%82% e5%bf%b5%e6%80%bb%e7%bb%93/
Http://www.aboutyun.com/thread-6713-1-1.html

[db] Hadoop && Pig

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.