Hadoop && Pighadoop
Recently need to use the Hadoop operation, found that the website of Hadoop really conscience, not so much nonsense, directly understand how to use, but also Chinese, simple rough ah!!!
Hadoop document
In MapReduce, the output of map has automatic sorting function!!!
Pig
There is also a pig language, a process language, similar to MySQL (this thing is not familiar ~ awkward).
Summarize the use of pig, incidentally experiment example.
Pig data type
- Double | float | Long | int | Chararry
| ByteArray
- Tuple | Bag | Map
Tuple similar to Matlab cell, element type can be different (' haha ', 1)
Bag equals a set of tuple, denoted by {}, {(' haha ', 1), (' hehe ', 2)}
field is the column data ID
Map equals hash table, key is chararray,value for any type
Running and commenting
运行:本地模式:pig -x local集群模式:pig -x mapreduce 或者 pig批处理pig文件,上两行命令后接pig文件名,pig xx.pig注释:行注释 --段注释 /**/
Pig Latin
>> cat 1.txta 1 2 3 4.2 9.8a 3 0 5 3.5 2.1b 7 9 9 - -a 7 9 9 2.6 6.2a 1 2 5 7.7 5.9a 1 2 3 1.4 0.2
[] represents an option
Pig command is case insensitive
LOAD
load = LOAD ‘data_path‘ [USING function] [AS schema]A = LOAD ‘1.txt‘ USING PigStorage(‘ ‘) AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);
The data of each row of 1.txt is divided into the corresponding col1,col2 column name for data parsing, if not specified, it can be $0 $n
indexed.
The pig defaults PigStorage()
to reading the local disk or Hadoop path data, org.apache.hcatalog.pig.HCatLoader()
reading the hive table, and () the delimiter.
GROUP
B = GROUP A BY (col2, col3, col4);out:((1,2,3),{(a,1,2,3,1.4,0.2),(a,1,2,3,4.2,9.8)})((1,2,5),{(a,1,2,5,7.7,5.9)})((3,0,5),{(a,3,0,5,3.5,2.1)})((7,9,9),{(a,7,9,9,2.6,6.2),(b,7,9,9,,)})
Use (col2, col3, COL4) to group A, and then group each tuple into one bag. B:{group: (COL2,COL3,COL4), a:bag:{tuple,tuple}}
Group grouping operations, grouping data into Group_col:bag, the first field is named ' Group ', and the second field is bag, which contains all the tuple values that correspond to ' group '.
Foreach
C = FOREACH B GENERATE group, AVG(A.col5), AVG(A.col6);out:((1,2,3),2.8,5.0)((1,2,5),7.7,5.9)((3,0,5),3.5,2.1)((7,9,9),2.6,6.2)其中(1.4+4,2)/2=2.8
foreach is a tuple that iterates through each group and processes it.
AVG(平均),COUNT(计数),MIN,MAX
Basic and Excel abbreviations are consistent.
C:{group: (COL2,COL3,COL4), double,double}.
Typically foreach and generate are used in a piece, and it is recommended to use the Foreach generate to filter out excess information as early as possible to reduce data exchange when the data is large.
FILTER
d = filter A by $0 == ‘a‘;out:(a,1,2,3,4.2,9.8)(a,3,0,5,3.5,2.1)(a,7,9,9,2.6,6.2)(a,1,2,5,7.7,5.9)(a,1,2,3,1.4,0.2)
Filter data as required, you can connect multiple conditions with and or to filter out useless informationnull 0 -1...
CONCAT & SUBSTRING
B = FOREACH A GENERATE CONCAT($0, (chararray)$1,(chararray)$2,(chararray)$3);out:(a123)(a305)(b799)(a799)(a125)(a123)C = foreach B generate (chararray)SUBSTRING($0,0,2);out:(a1)(a3)(b7)(a7)(a1)(a1)
Concat stitching two strings, substring the string by length [0,2) left closed right open interval.
Display and storage
DUMP C;STORE C INTO ‘output_path‘
Dump shows all the out parts above. Store is storage.
Join,union,cogroup,cross
a.txt:(2,Tie)(4,Coat)(3,Hat)(1,Scarf)b.txt:(Joe,2)(Hank,4)(Ali,0)(Eve,3)(Hank,2)A = LOAD ‘a.txt‘ USING PigStorage(‘,‘);B = LOAD ‘b.txt‘ USING PigStorage(‘,‘);
JOIN
C = JOIN A BY $0, B BY $1;out:(2,Tie,Hank,2)(2,Tie,Joe,2)(3,Hat,Eve,3)(4,Coat,Hank,4)
Follow the key to get the line added. Inner join, generally with a small table join B large table, play a part of the role of filtering.
There's another one.left join: left outer
UNION
D = UNION A, B;out:(Joe,2)(Hank,4)(Ali,0)(Eve,3)(Hank,2)(2,Tie)(4,Coat)(3,Hat)(1,Scarf)
You can perform a union operation on datasets of different field numbers.
Cogroup
E = COGROUP A BY $0, B BY $1;E = COGROUP A BY $0, B BY $1 outer;out:(0,{},{(Ali,0)})(1,{(1,Scarf)},{})(2,{(2,Tie)},{(Hank,2),(Joe,2)})(3,{(3,Hat)},{(Eve,3)})(4,{(4,Coat)},{(Hank,4)})F = COGROUP A BY $0 inner, B BY $1;out:(1,{(1,Scarf)},{})(2,{(2,Tie)},{(Hank,2),(Joe,2)})(3,{(3,Hat)},{(Eve,3)})(4,{(4,Coat)},{(Hank,4)})
Outputs a set of nested tuple structures. Cogroup generates a tuple for each of the different keys. The first field of each tuple is key. The other fields are the bag of tuples that match the key value in each relationship. The first bag is a matched tuple in a, the second bag is B, and no match is empty {}.
Cogroup The default type of outer connection.
Cross
F = CROSS A, B;out:(1,Scarf,Hank,2)(1,Scarf,Eve,3)(1,Scarf,Ali,0)(1,Scarf,Hank,4)(1,Scarf,Joe,2)(3,Hat,Hank,2)(3,Hat,Eve,3)(3,Hat,Ali,0)(3,Hat,Hank,4)(3,Hat,Joe,2)(4,Coat,Hank,2)(4,Coat,Eve,3)(4,Coat,Ali,0)(4,Coat,Hank,4)(4,Coat,Joe,2)(2,Tie,Hank,2)(2,Tie,Eve,3)(2,Tie,Ali,0)(2,Tie,Hank,4)(2,Tie,Joe,2)
Cross Cartesian product. Each tuple in the first relationship is concatenated with all tuples in the second one. The size of the output of this operation is the product of the size of the input relationship.
For the time being so much, the follow-up may be replenished.
Reference
http://blackproof.iteye.com/blog/1791980
Http://www.360doc.com/content/15/0520/20/13670635_472030452.shtml
http://blog.csdn.net/gg584741/article/details/51712242
https://www.codelast.com/%E5%8E%9F%E5%88%9Bpig%E4%B8%AD%E7%9A%84%E4%B8%80%E4%BA%9B%E5%9F%BA%E7%A1%80%E6%A6%82% e5%bf%b5%e6%80%bb%e7%bb%93/
Http://www.aboutyun.com/thread-6713-1-1.html
[db] Hadoop && Pig