Pig installation and simple use (Pig version 0.13.0,hadoop version 2.5.0)

Source: Internet
Author: User

Original address: http://www.linuxidc.com/Linux/2014-03/99055.htm

We use MapReduce for data analysis. When the business is more complex, the use of MapReduce will be a very complex thing, such as you need to do a lot of preprocessing or transformation of the data to be able to adapt to the MapReduce processing mode, on the other hand, write a mapreduce program, Publishing and running jobs will be a time-consuming task.


The appearance of pig makes up for this shortcoming well. Pig allows you to focus on the data and the business itself rather than on the format conversion of the data and the writing of the MapReduce program. Essentially, when you use pig for processing, the pig itself generates a series of mapreduce operations in the background to perform the task, but the process is transparent to the user.

Installation of Pig

Pig runs as a client program, and even if you're ready to use pig on a Hadoop cluster, you don't need to do any installation on the cluster. Pig submits jobs locally and interacts with Hadoop.

Installation of Pig

Pig runs as a client program, and even if you're ready to use pig on a Hadoop cluster, you don't need to do any installation on the cluster. Pig submits jobs locally and interacts with Hadoop.

1) Download Pig

Go to http://mirror.bit.edu.cn/apache/pig/to download the appropriate version, such as Pig 0.12.0

2) Unzip the file to the appropriate directory

Tar–xzf pig-0.12.0

3) Setting Environment variables

Export Pig_home=/home/hadoop/pig

Export path= $PATH: $PIG _home/bin

If you do not set the Java environment variable, you also need to set java_home, such as:

Export java_home=/usr/local/jdk1.7.0_51

4) Verification

Run the following command to see if Pig is available:

Pig–help

Pig Execution mode

Pig has two modes of execution, namely:

1) native mode (local)

In native mode, pig runs in a single JVM and can access local files. This model is suitable for processing small-scale data or learning.

Run the following naming set to local mode:

Pig–x Local

2) MapReduce mode

In MapReduce mode, pig submits the query to a mapreduce job for submission to Hadoop (which can be said to be a cluster, or pseudo-distributed).

You should check whether the current version of the version of Pig supports the versions of Hadoop that you currently use. A version of Pig only supports specific versions of Hadoop, and you can get version support information by visiting the Pig website.

Pig will use the HADOOP_HOME environment variable. If the variable is not set, pig can also take advantage of its own Hadoop library, but there is no guarantee that its own library is compatible with the version of Hadoop you are actually using, so it is recommended to explicitly set the hadoop_home variable. You also need to set the following variables: (Be sure to configure if the version pig and Hadoop do not match)

Exportpig_classpath= $HADOOP _home/etc/hadoop

Next, you need to tell pig about the Namenode and jobtracker of the Hadoop cluster it uses. In general, after the configuration of Hadoop is properly installed, these configuration information is available and no additional configuration is required.

Pig default mode is MapReduce, you can also use the following command to set:

Pig–x MapReduce

Run Pig Program

There are three ways to perform pig programs:

1) Script mode

Run the file that contains the pig script directly, such as the following command to run all the commands in the local Scripts.pig file:

Pig Scripts.pig

2) Grunt mode

Grunt provides an interactive runtime environment where you can edit execution commands at the command line.

Grund also supports the history of commands, accessed by the UP and DOWN ARROW keys.

Grund supports automatic completion of commands. For example, when you enter a =foreach b g and press the TAB key, the command line automatically becomes a = foreach B generate. You can even customize the details of the command auto-completion feature. Please refer to the relevant documentation for details.

3) Embedded mode

You can run pig programs in Java, similar to running SQL programs using JDBC.

Pig Latin Editor

Pigpen is a ecliipse plugin that provides common functionality for developing pig programs in eclipse, such as script editing, running, and so on. : Http://wiki.apache.org/pig/PigPen

Some other editors also provide the ability to edit pig scripts, such as Vim.

Simple to use

For example, we show how to use pig to count the maximum temperature per year for the highest temperature. Suppose the data file contents are as follows (one record per line, tab split):

1990 21

1990 18

1991 21

1992 30

1992 999

1990 23


Enter pig in local mode, and then type the following command (note the semicolon concluding sentence):

Records = Load '/home/adoop/input/temperature1.txt ' as (year:chararray,temperature:int);

Dump Records;

Describe records;

Valid_records = Filter records bytemperature!=999;

Grouped_records = Group Valid_records byyear;

Dump Grouped_records;

Describe Grouped_records;

Max_temperature = foreach Grouped_recordsgenerate Group,max (valid_records.temperature);

--Note: Valid_records is the field name, and you can see the specific structure of group_records in the describe command result of the previous statement.

Dump Max_temperature;

The end result is:

(1990,23)

(1991,21)

(1992.30)

Attention:

1) If you run the Pig command and the error message contains the following information:

WARN org.apache.pig.backend.hadoop20.pigjobcontrol-falling back to Defaultjobco) Ntrol (not using Hadoop 0.20?)

Java.lang.NoSuchFieldException:runnerState

Your version of Pig may not be compatible with the Hadoop version. You can now re-edit for a specific version of Hadoop. After downloading the source code, go to the source code root directory and execute the following command:

Ant Clean jar-withouthadoop-dhadoopversion=23

Note: The version number is based on the specific Hadoop, where 23 is available for Hadoop2.2.0.

2) Pig can only work in one mode at a time, such as in MapReduce mode, can only read the HDFs file, if you use load to read local files, will be error.

Related reading:

Pig installation and configuration tutorial http://www.linuxidc.com/Linux/2013-04/82785.htm

Pig installation Deployment and MapReduce mode test http://www.linuxidc.com/Linux/2013-04/82786.htm

Pig installation and local mode test, experience http://www.linuxidc.com/Linux/2013-04/82783.htm

Pig installation configuration and basic use Http://www.linuxidc.com/Linux/2013-02/79928.htm

Hadoop Pig Advanced Syntax http://www.linuxidc.com/Linux/2013-02/79462.htm

Hadoop Build Tutorial Learning Note http://www.linuxidc.com/Linux/2013-03/81669.htm

Pig installation and simple use (Pig version 0.13.0,hadoop version 2.5.0)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.