Original address: http://www.linuxidc.com/Linux/2014-03/99055.htm
We use MapReduce for data analysis. When the business is more complex, the use of MapReduce will be a very complex thing, such as you need to do a lot of preprocessing or transformation of the data to be able to adapt to the MapReduce processing mode, on the other hand, write a mapreduce program, Publishing and running jobs will be a time-consuming task.
The appearance of pig makes up for this shortcoming well. Pig allows you to focus on the data and the business itself rather than on the format conversion of the data and the writing of the MapReduce program. Essentially, when you use pig for processing, the pig itself generates a series of mapreduce operations in the background to perform the task, but the process is transparent to the user.
Installation of Pig
Pig runs as a client program, and even if you're ready to use pig on a Hadoop cluster, you don't need to do any installation on the cluster. Pig submits jobs locally and interacts with Hadoop.
Installation of Pig
Pig runs as a client program, and even if you're ready to use pig on a Hadoop cluster, you don't need to do any installation on the cluster. Pig submits jobs locally and interacts with Hadoop.
1) Download Pig
Go to http://mirror.bit.edu.cn/apache/pig/to download the appropriate version, such as Pig 0.12.0
2) Unzip the file to the appropriate directory
Tar–xzf pig-0.12.0
3) Setting Environment variables
Export Pig_home=/home/hadoop/pig
Export path= $PATH: $PIG _home/bin
If you do not set the Java environment variable, you also need to set java_home, such as:
Export java_home=/usr/local/jdk1.7.0_51
4) Verification
Run the following command to see if Pig is available:
Pig–help
Pig Execution mode
Pig has two modes of execution, namely:
1) native mode (local)
In native mode, pig runs in a single JVM and can access local files. This model is suitable for processing small-scale data or learning.
Run the following naming set to local mode:
Pig–x Local
2) MapReduce mode
In MapReduce mode, pig submits the query to a mapreduce job for submission to Hadoop (which can be said to be a cluster, or pseudo-distributed).
You should check whether the current version of the version of Pig supports the versions of Hadoop that you currently use. A version of Pig only supports specific versions of Hadoop, and you can get version support information by visiting the Pig website.
Pig will use the HADOOP_HOME environment variable. If the variable is not set, pig can also take advantage of its own Hadoop library, but there is no guarantee that its own library is compatible with the version of Hadoop you are actually using, so it is recommended to explicitly set the hadoop_home variable. You also need to set the following variables: (Be sure to configure if the version pig and Hadoop do not match)
Exportpig_classpath= $HADOOP _home/etc/hadoop
Next, you need to tell pig about the Namenode and jobtracker of the Hadoop cluster it uses. In general, after the configuration of Hadoop is properly installed, these configuration information is available and no additional configuration is required.
Pig default mode is MapReduce, you can also use the following command to set:
Pig–x MapReduce
Run Pig Program
There are three ways to perform pig programs:
1) Script mode
Run the file that contains the pig script directly, such as the following command to run all the commands in the local Scripts.pig file:
Pig Scripts.pig
2) Grunt mode
Grunt provides an interactive runtime environment where you can edit execution commands at the command line.
Grund also supports the history of commands, accessed by the UP and DOWN ARROW keys.
Grund supports automatic completion of commands. For example, when you enter a =foreach b g and press the TAB key, the command line automatically becomes a = foreach B generate. You can even customize the details of the command auto-completion feature. Please refer to the relevant documentation for details.
3) Embedded mode
You can run pig programs in Java, similar to running SQL programs using JDBC.
Pig Latin Editor
Pigpen is a ecliipse plugin that provides common functionality for developing pig programs in eclipse, such as script editing, running, and so on. : Http://wiki.apache.org/pig/PigPen
Some other editors also provide the ability to edit pig scripts, such as Vim.
Simple to use
For example, we show how to use pig to count the maximum temperature per year for the highest temperature. Suppose the data file contents are as follows (one record per line, tab split):
1990 21
1990 18
1991 21
1992 30
1992 999
1990 23
Enter pig in local mode, and then type the following command (note the semicolon concluding sentence):
Records = Load '/home/adoop/input/temperature1.txt ' as (year:chararray,temperature:int);
Dump Records;
Describe records;
Valid_records = Filter records bytemperature!=999;
Grouped_records = Group Valid_records byyear;
Dump Grouped_records;
Describe Grouped_records;
Max_temperature = foreach Grouped_recordsgenerate Group,max (valid_records.temperature);
--Note: Valid_records is the field name, and you can see the specific structure of group_records in the describe command result of the previous statement.
Dump Max_temperature;
The end result is:
(1990,23)
(1991,21)
(1992.30)
Attention:
1) If you run the Pig command and the error message contains the following information:
WARN org.apache.pig.backend.hadoop20.pigjobcontrol-falling back to Defaultjobco) Ntrol (not using Hadoop 0.20?)
Java.lang.NoSuchFieldException:runnerState
Your version of Pig may not be compatible with the Hadoop version. You can now re-edit for a specific version of Hadoop. After downloading the source code, go to the source code root directory and execute the following command:
Ant Clean jar-withouthadoop-dhadoopversion=23
Note: The version number is based on the specific Hadoop, where 23 is available for Hadoop2.2.0.
2) Pig can only work in one mode at a time, such as in MapReduce mode, can only read the HDFs file, if you use load to read local files, will be error.
Related reading:
Pig installation and configuration tutorial http://www.linuxidc.com/Linux/2013-04/82785.htm
Pig installation Deployment and MapReduce mode test http://www.linuxidc.com/Linux/2013-04/82786.htm
Pig installation and local mode test, experience http://www.linuxidc.com/Linux/2013-04/82783.htm
Pig installation configuration and basic use Http://www.linuxidc.com/Linux/2013-02/79928.htm
Hadoop Pig Advanced Syntax http://www.linuxidc.com/Linux/2013-02/79462.htm
Hadoop Build Tutorial Learning Note http://www.linuxidc.com/Linux/2013-03/81669.htm
Pig installation and simple use (Pig version 0.13.0,hadoop version 2.5.0)