We use MapReduce for data analysis. When the business is complicated, using MapReduce will be a very complicated task. For example, you need to perform a lot of preprocessing or conversion on the data to adapt to the MapReduce processing mode. On the other hand, writing MapReduce programs, publishing and running jobs will be time-consuming.
The emergence of Pig makes up for this deficiency. Pig allows you to focus on data and business, rather than on data format conversion and MapReduce program writing. In essence, when you use Pig for processing, Pig itself will generate a series of MapReduce operations in the background to execute tasks, but this process is transparent to users.
Pig Installation
Pig runs as a client. Even if you are about to use Pig on a Hadoop cluster, you do not need to install Pig on the cluster. Pig submits jobs locally and interacts with Hadoop.
1) download Pig
Go to the http://mirror.bit.edu.cn/apache/pig/ to download the right version, such as Pig 0.12.0
2) decompress the file to the appropriate directory.
Tar-xzf pig-0.12.0
3) set Environment Variables
Export PIG_HOME =/home/hadoop/pig
Export PATH = $ PATH: $ PIG_HOME/bin
If JAVA environment variables are not set, you need to set JAVA_HOME, for example:
Export JAVA_HOME =/usr/local/jdk1.7.0 _ 51
4) Verification
Run the following command to check whether Pig is available:
Pig-help
Pig Execution Mode
Pig has two execution modes:
1) Local)
In local mode, Pig runs in a single JVM and can access local files. This mode is suitable for processing small-scale data or learning.
Run the following naming settings to set the Local Mode:
Pig-x local
2) MapReduce Mode
In MapReduce mode, Pig converts a query to a MapReduce job and submits it to Hadoop (Clustering or pseudo-distributed ).
Check whether the current Pig version supports your current Hadoop version. Pig of a certain version only supports Hadoop of a specific version. You can visit the Pig official website to obtain version support information.
Pig uses the HADOOP_HOME environment variable. If this variable is not set, Pig can also use its own Hadoop library, but this will not ensure that its built-in validation library is compatible with the HADOOP version you actually use, therefore, we recommend that you explicitly set the HADOOP_HOME variable. You also need to set the following variables:
ExportPIG_CLASSPATH = $ HADOOP_HOME/etc/hadoop
Next, inform Pig of the Namenode and Jobtracker of the Hadoop cluster it uses. Generally, after Hadoop is correctly installed and configured, the configuration information is available and no additional configuration is required.
The default Pig mode is mapreduce. You can also use the following command to set the mode:
Pig-x mapreduce
Run Pig Program
There are three Pig program execution methods:
1) Script Mode
Directly run the file containing the Pig script. For example, the following command will run all the commands in the local scripts. pig file:
Pig scripts. pig
2) Grunt Mode
Grunt provides an interactive running environment where you can edit and execute commands on the command line.
Grund also supports command history and access through the up and down arrow keys.
Grund supports automatic command completion. For example, when you enter a = foreach B g and press the Tab key, the command line automatically becomes a = foreach B generate. You can even customize the detailed method of auto-complementing full functions of commands. For more information, see related documents.
3) embedded Mode
You can run Pig programs in java, similar to running SQL programs using JDBC.
Pig Latin Editor
PigPen is an Ecliipse plug-in that provides common functions for developing and running Pig programs in Eclipse, such as script editing and running. : Http://wiki.apache.org/pig/PigPen
Other Editors also provide the Pig script editing function, such as vim.
Easy to use
The following example shows how to use Pig to calculate the maximum temperature of a year. Assume that the content of the data file is as follows (one record in each row is separated by tabs ):
1990 21
1990 18
1991 21
1992 30
1992 999
1990 23
Enter pig in local mode, and enter the following commands in sequence (note that you should end with a semicolon ):
Records = load '/home/adoop/input/temperature1.txt' as (year: chararray, temperature: int );
Dump records;
Describe records;
Valid_records = filter records bytemperature! = 999;
Grouped_records = group valid_records byyear;
Dump grouped_records;
Describe grouped_records;
Max_temperature = foreach grouped_recordsgenerate group, MAX (valid_records.temperature );
-- Note: valid_records is the field name. You can view the detailed structure of group_records in the result of the describe command in the previous statement.
Dump max_temperature;
The final result is:
(1990,23)
(1991,21)
(1992.30)
Note:
1) if an error is reported after you run the Pig command and the error message contains the following information:
WARN org. apache. pig. backend. hadoop20.PigJobControl-falling back to defaultJobCo) ntrol (not using hadoop 0.20 ?)
Java. lang. NoSuchFieldException: runnerState
Your Pig version may be incompatible with the Hadoop version. In this case, you can re-edit the specified Hadoop version. After downloading the source code, go to the source code root directory and run the following command:
Ant clean jar-withouthadoop-Dhadoopversion = 23
Note: the version number depends on Hadoop. Here 23 can be used for Hadoop2.2.0.
2) Pig can only work in one mode at a time. For example, after entering the MapReduce mode, Pig can only read HDFS files. If you use load to read local files, an error is returned.
Pig installation and configuration tutorial
Pig installation and deployment and testing in MapReduce Mode
Install Pig and test in local mode.
Installation configuration and basic use of Pig
Hadoop Pig advanced syntax
Hadoop build Tutorial Study Notes