Pig installation and simple use (pig0.12.0 Hadoop2.2.0)

Source: Internet
Author: User

We use MapReduce for data analysis. When the business is complicated, using MapReduce will be a very complicated task. For example, you need to perform a lot of preprocessing or conversion on the data to adapt to the MapReduce processing mode. On the other hand, writing MapReduce programs, publishing and running jobs will be time-consuming.


The emergence of Pig makes up for this deficiency. Pig allows you to focus on data and business, rather than on data format conversion and MapReduce program writing. In essence, when you use Pig for processing, Pig itself will generate a series of MapReduce operations in the background to execute tasks, but this process is transparent to users.

Pig Installation

Pig runs as a client. Even if you are about to use Pig on a Hadoop cluster, you do not need to install Pig on the cluster. Pig submits jobs locally and interacts with Hadoop.

1) download Pig

Go to the http://mirror.bit.edu.cn/apache/pig/ to download the right version, such as Pig 0.12.0

2) decompress the file to the appropriate directory.

Tar-xzf pig-0.12.0

3) set Environment Variables

Export PIG_HOME =/home/hadoop/pig

Export PATH = $ PATH: $ PIG_HOME/bin

If JAVA environment variables are not set, you need to set JAVA_HOME, for example:

Export JAVA_HOME =/usr/local/jdk1.7.0 _ 51

4) Verification

Run the following command to check whether Pig is available:

Pig-help

Pig Execution Mode

Pig has two execution modes:

1) Local)

In local mode, Pig runs in a single JVM and can access local files. This mode is suitable for processing small-scale data or learning.

Run the following naming settings to set the Local Mode:

Pig-x local

2) MapReduce Mode

In MapReduce mode, Pig converts a query to a MapReduce job and submits it to Hadoop (Clustering or pseudo-distributed ).

Check whether the current Pig version supports your current Hadoop version. Pig of a certain version only supports Hadoop of a specific version. You can visit the Pig official website to obtain version support information.

Pig uses the HADOOP_HOME environment variable. If this variable is not set, Pig can also use its own Hadoop library, but this will not ensure that its built-in validation library is compatible with the HADOOP version you actually use, therefore, we recommend that you explicitly set the HADOOP_HOME variable. You also need to set the following variables:

ExportPIG_CLASSPATH = $ HADOOP_HOME/etc/hadoop

Next, inform Pig of the Namenode and Jobtracker of the Hadoop cluster it uses. Generally, after Hadoop is correctly installed and configured, the configuration information is available and no additional configuration is required.

The default Pig mode is mapreduce. You can also use the following command to set the mode:

Pig-x mapreduce

Run Pig Program

There are three Pig program execution methods:

1) Script Mode

Directly run the file containing the Pig script. For example, the following command will run all the commands in the local scripts. pig file:

Pig scripts. pig

2) Grunt Mode

Grunt provides an interactive running environment where you can edit and execute commands on the command line.

Grund also supports command history and access through the up and down arrow keys.

Grund supports automatic command completion. For example, when you enter a = foreach B g and press the Tab key, the command line automatically becomes a = foreach B generate. You can even customize the detailed method of auto-complementing full functions of commands. For more information, see related documents.

3) embedded Mode

You can run Pig programs in java, similar to running SQL programs using JDBC.

Pig Latin Editor

PigPen is an Ecliipse plug-in that provides common functions for developing and running Pig programs in Eclipse, such as script editing and running. : Http://wiki.apache.org/pig/PigPen

Other Editors also provide the Pig script editing function, such as vim.

Easy to use

The following example shows how to use Pig to calculate the maximum temperature of a year. Assume that the content of the data file is as follows (one record in each row is separated by tabs ):

1990 21

1990 18

1991 21

1992 30

1992 999

1990 23

 


Enter pig in local mode, and enter the following commands in sequence (note that you should end with a semicolon ):

Records = load '/home/adoop/input/temperature1.txt' as (year: chararray, temperature: int );

Dump records;

Describe records;

Valid_records = filter records bytemperature! = 999;

Grouped_records = group valid_records byyear;

Dump grouped_records;

Describe grouped_records;

Max_temperature = foreach grouped_recordsgenerate group, MAX (valid_records.temperature );

-- Note: valid_records is the field name. You can view the detailed structure of group_records in the result of the describe command in the previous statement.

Dump max_temperature;

The final result is:

(1990,23)

(1991,21)

(1992.30)

Note:

1) if an error is reported after you run the Pig command and the error message contains the following information:

WARN org. apache. pig. backend. hadoop20.PigJobControl-falling back to defaultJobCo) ntrol (not using hadoop 0.20 ?)

Java. lang. NoSuchFieldException: runnerState

Your Pig version may be incompatible with the Hadoop version. In this case, you can re-edit the specified Hadoop version. After downloading the source code, go to the source code root directory and run the following command:

Ant clean jar-withouthadoop-Dhadoopversion = 23

Note: the version number depends on Hadoop. Here 23 can be used for Hadoop2.2.0.

2) Pig can only work in one mode at a time. For example, after entering the MapReduce mode, Pig can only read HDFS files. If you use load to read local files, an error is returned.

Pig installation and configuration tutorial

Pig installation and deployment and testing in MapReduce Mode

Install Pig and test in local mode.

Installation configuration and basic use of Pig

Hadoop Pig advanced syntax

Hadoop build Tutorial Study Notes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.