The difference between 1.pig and hive
Pig and Hive are similar, both are SQL-like languages, and the underlying is dependent on Hadoop
Go to the MapReduce task.
The difference between pig and hive is that if you want to implement a business logic, using pig requires step-by-step operation
With hive, a single SQL can be done.
It is recommended to use pig if you want to get a more complex business logic processing result in a very short period of time.
If you need to perform some tasks regularly, we recommend that you use hive.
Comparison of 2:pig and MapReduce
Pig Advantage: For some basic processing logic, has done the encapsulation, the direct use of the corresponding command.
But using the MapReduce code requires that we implement it ourselves, and that using pig does not require
Consider the optimization of your code.
There is a problem with data skew when using MapReduce, but using pig can avoid this problem.
3:pig's application Scenario
The main thing is data cleaning.
How to use 4:pig
1: Execute command under Pig (Grunt) command line
2: Execute under shell command
Bin/pig-e "a = Load ' a.txt ' as (id:int,name:chararray);d UMP A;"
3: Using scripts to execute
VI My.pig
--Single line comment
/*
Multi-line comments
*/
A = Load ' a.txt ' as (Id:int,name:chararray);
Dump A;
Execute command
Bin/pig My.pig
Type of data inside 5:pig
Basic data types
int, long, float, double, Chararray, ByteArray, Boolean, DateTime, BigInteger, BigDecimal
Note: Chararray, which represents the string type
Composite data types
Tuple, bag, map
Attention:
Tupe (a)
Bag {(All-in-a-do), (all-in-a-.)}
Map [Key#value]
Some of the commands in 6:pig
Load: Load Data
A = Load ' a.txt ' as (Id:int,name:chararray)
Note: The file data structure information specified after as, this indicates that the A.txt file is found in the current directory,
The data in the file is parsed according to (Id:int,name:chararray). This requires ensuring that the file
The middle of these two columns of data is split with tabs.
If multiple columns in the loaded data are not separated by tabs, then when the data is loaded
You need to execute the specified delimiter, and the example is split with commas.
Example: A = Load ' a.txt ' using Pigstorage (",") as (Id:int,name:chararray);
There must be a space between a and =. A space is added between the suggested commands to make it look concise and error-prone.
Describe: View table structure similar to SQL
Example: Describe A;
Group: grouping, similar to groupby in SQL
Example: B = group A by ID;
foreach: Iterating through the data in the result set
Example: C = foreach A generate ID, name;
Or you can use a similar command to get the data in the result set
C = foreach A generate $, $;
Filter: Filtering
Example: D = filter A by id = = ' ZS ';
Common expressions: = =
!=
>=
<=
>
<
Join: Similar to table links in SQL
Links Within:
C = Join A by id,b by ID;
External connection
Left outer connection: C = Join A by ID-outer,b by ID;
When querying data, the data on the left is the baseline, and only the data on the left is returned.
Right outer connection: C = Join A by ID-outer,b by ID;
When querying data, the right-hand data is the benchmark, returning only the data on the right.
All-out connection: C = Join A by ID full outer,b by ID;
When querying data, both sides of the data can be returned.
7:pig's command.
Limit: Similar to limit in SQL, you can get a subset of the data in the data set
Example: B = Limit A 10; (Fetch the first 10 data)
In this way, pig still reads all data, but returns the specified number of data at the time of return.
ORDER BY: Sort, similar to the order by in SQL
Example: B = order A by ID;
The default is the positive order, the thought of the reverse word need to add desc after the specified field;
You can specify more than one grouping field after order by.
Order A by id DESC, name desc (sort the data in A, reverse-order by ID, and if the IDs are equal, use the Name field to sort in reverse order.) )
SPLIT: Slicing data sets based on criteria
Example: Split A into x1 if x>10, x2 if x==20, x3 if (x>1 and x<10);
Union: Similar to union ALL in SQL
Example: C = Union A, B;
In general, the data structures of the two temporary variables that use the union operation are the same,
If not the same, you can divide the following two kinds of situations
1: The ID field in a is of type float, and the ID field in B is of type double, so that float can be converted to double, so the temporary variable C generated after union uses the ID of the double type.
2: A In the ID field is the float type, b in the ID field is Chararray type, so that two data types can not want to change the conversion, so execution will error.
Considerations for 8:pig commands
1: All commands are with; end
2:pig does not specify the case of the command, either uppercase or lowercase, but it must use uppercase for functions in pig, because these functions are capitalized when defined in pig.
The casing is also distinguished for the name of the temporary variable during pig processing.
3: Commands executed in pig are not actually executed, but only when the dump or store command is executed does the previously defined command actually be specified.
4: If a command execution fails, then you just need to modify the command and re-execute it.
9:pig extension of the command line
FS: Commands in HDFs can be executed in pig
SH: Shell commands can be executed in pig (after version 0.8)
Clear: Clear Screen
EXEC: an example of a pig script executed under the pig command line: Pig My.pig
History: View the command record executed under the Pig command line
Quit: Quit the pig command line
Built-in functions in 10:pig
Avg: Averaging
Example: B = group A by ID;
C = FOREACG B generate Group,avg (score);
Sum: Sum
Max: Ask for maximum value
Min: Find minimum value
The above functions use the same method.
Note: The function name must be uppercase.
COUNT (): Total.
1: Group first, then sum. Consistent with the above usage.
2: Perform a from table operation similar to select COUNT (*).
This also requires grouping in the table, because it requires all the data, so you can divide all the data into a group,
Use B = Group A all, (all equals a keyword)
Then use c = foreach B generate COUNT (A);
Custom Functions in 10:pig
Using Java code to customize functions, you need to create a class that integrates the Evalfunc class and implements the Exec function for implementation, where the data can be processed.
Add Maven Dependencies First
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
Package pig;
Import java.io.IOException;
Import Org.apache.pig.EvalFunc;
Import Org.apache.pig.data.Tuple;
public class UPPER extends Evalfunc<string> {
@Override
Public String exec (Tuple input) throws IOException {
if (Input==null | | input.size () ==0 | | input.get (0) ==null) {
return null;
}
try {
String str = (string) input.get (0);
return Str.touppercase ();
} catch (Exception e) {
throw new IOException ("Convert uppercase Failed", e);
}
}
}
After the code is implemented, it needs to be packaged.
You do not need to rely on this package, right-click Export in Eclipse as a jar.
Put this jar on the server where pig is located, it is recommended to put it under pig's root directory.
When you want to use this function later, you need to register the jar in pig first.
Register Myudf.jar
The functions defined above can be used as follows.
Example:
--VI Myscript.pig
REGISTER Myudfs.jar;
A = LOAD ' A.txt ' as (Name:chararray, age:int);
B = FOREACH A GENERATE Pig. UPPER (name);
DUMP B;
Pig-x Local Myscript.pig
11: Combat One (WLAN Internet log processing)
1: Upload the data to the server first
2: Load this data in pig
A = LOAD ' Http_20130313143750.dat ' as (Reporttime:long, Msisdn:chararray, Apmac:chararray, Acmac:chararray, Host: Chararray, Sitetype:chararray, Uppacknum:long, Downpacknum:long, Uppayload:long, Downpayload:long, HttpStatus: Chararray);
3: Intercept the required data
B = FOREACH A GENERATE msisdn, Uppacknum, Downpacknum, Uppayload, downpayload;
4: Grouping data
C = GROUP B by MSISDN;
5: Statistics on upstream and downstream data
D = FOREACH C GENERATE Group, sum (b.uppacknum), sum (b.downpacknum), sum (b.uppayload), sum (b.downpayload);
6: Results of storage cleaning
STORE D into ' wlan_result ';
12: Combat II (Tomcat access log processing)
Statistics PV and UV
PV: is actually the total number of data in the statistics file
UV: In fact, the number of independent IP in the statistics file appears
1: Uploading data to the server
2: Loading data with pig
A = Load ' Access_2015_03_30.log ' USING pigstorage (") as (Ip:chararray,one:chararray,two:chararray,time:chararray, Timezone:chararray,method:chararray,url:chararray,http:chararray,statu:long,data:long);
Note: The data columns in this log file are separated by spaces, so you need to specify a delimiter when using load
3: Intercept the required data
B = foreach A generate ip,url;
4: Statistics PV
1) group The data in B into a group, using the ALL keyword
C = Group B all;
2) using the Count function to sum
PV = foreach C generate COUNT (B);
5: Statistics UV
1) group The data in B, using IP as the Group field, so that the value of this grouping field is not duplicated after grouping.
C = group B by IP;
2) After grouping the data in the C processing, because only need to statistics independent IP, so only need to get the Group field to
D = foreach C generate group;
3) Use the ALL keyword to group the data in D and divide it into an all group
E = Group D all;
4) Use the Count function to count the total number of independent IPs
UV = foreach E generate COUNT (D);
6: Integrate PV and UV into one piece
Pv_uv = Join PV by ' 1 ', UV by ' 1 ';
7: Also need to add time fields in PV and UV data
END = foreach Pv_uv generate ' 2013-05-30 ', $0,$1;
8: Save the results of cleaning data.
Store END into ' pv_uv ';
13:pig extension
Set the name of the Mr Task in pig
Set Job.name ' My-job-name ';
It is recommended that you specify the parameters above on the first line of each pig script when using the pig script, and set a different task name.
Pig command of Big Data