Inkfish original, do not reprint commercial nature, reproduced please indicate the source (http://blog.csdn.net/inkfish). (Source: Http://blog.csdn.net/inkfish)
Pig is a project Yahoo! donated to Apache and is currently in the Apache Incubator (incubator) phase, and the current version is v0.5.0. Pig is a large-scale data analysis platform based on Hadoop, which provides the sql-like language called Pig Latin, which translates the SQL-class data analysis request into a series of optimized mapreduce operations. Pig provides a simple operation and programming interface for complex mass data parallel computing. This article describes the installation of pig and the operation of a simple example, mainly referenced/translated from the official documentation of pig setup. (Source: Http://blog.csdn.net/inkfish)
Prerequisites: (Source: http://blog.csdn.net/inkfish) Linux/unix system, or Windows operating system with Cygwin, I am using Ubuntu 8.04; Hadoop 0.20.X JDK 1.6 or higher Ant 1.7 (optional, required if you want to compile pig yourself) JUnit 4.5 (optional, if you want to run unit tests)
installation of Pig (source: http://blog.csdn.net/inkfish)
1. Download Pig
You can go to pig's official home page to download the latest pig, and in this article, the latest version is Pig 0.5.0
2. Decompression
$ TAR-XVF pig-0.5.0.tar.gz
I usually like to put pig in the/opt/hadoop/pig-0.5.0 catalogue.
3. Setting Environment variables
To facilitate pig later, I created a soft link where the environment variable points to a soft link directory, and the soft link points to the latest pig version.
$ ln-s/opt/hadoop/pig-0.5.0/opt/hadoop/pig
Edit/etc/enviroment to join Pig's Bin subdirectory path in path (or modify ~/.BASHRC or ~/.profile).
4. Verify Installation Complete
Re-enter the terminal, type the env command, and you should see that path is in effect. When you type the PIG-HELP command, a Help message appears, representing that pig has been properly installed. (Source: Http://blog.csdn.net/inkfish)
Pig Mode of Operation (source: http://blog.csdn.net/inkfish)
1. Local mode
Pig runs in local mode and involves only a single computer.
2.MapReduce mode
Pig runs in MapReduce mode, requires access to a Hadoop cluster and needs to be fitted with HDFS.
Pig Invocation Method (Source: http://blog.csdn.net/inkfish) Grunt shell way: by interacting, enter command execution task; Pig script: Run a task by script ; embedded mode: embedded in Java source code, used to run tasks through Java.
Pig Sample Code (Source: http://blog.csdn.net/inkfish)
Here are the three different ways to call each other. First, show the source code that the sample needs to use, which is the same as in the official document, but with the following modifications: Fixed an error in the official document, that is, id.pig the last line of the id.out on both sides of the Full-width single quotation mark to half-width single quotes; 2. Fixed an error in the official document that Idmapreduce.java's Runidquery method has a semicolon at the end of the first line; 3. According to Java common naming conventions, the class name is capitalized.
Script file: Id.pig (Source: http://blog.csdn.net/inkfish)
A = Load ' passwd ' using Pigstorage (': '); B = foreach A generate $ as ID; Dump B; Store B into ' id.out ';
Java files in local mode: Idlocal.java (Source: http://blog.csdn.net/inkfish)
Import java.io.IOException; Import Org.apache.pig.PigServer; public class idlocal{public static void Main (string[] args) {try {pigserver pigserver = new Pigserver (' local '); runidqu Ery (Pigserver, "passwd"); The catch (Exception e) {}} is public static void Runidquery (Pigserver pigserver, String inputfile) throws IOException {Pigse Rver.registerquery ("A = Load '" + inputfile + "' Using Pigstorage (': ');"); Pigserver.registerquery ("B = foreach A generate $ as ID;"); Pigserver.store ("B", "id.out"); } }
Java files for MapReduce mode: Idmapreduce.java (Source: http://blog.csdn.net/inkfish)
Import java.io.IOException; Import Org.apache.pig.PigServer; public class idmapreduce{public static void Main (string[] args) {try {pigserver pigserver = new Pigserver ("MapReduce"); Runidquery (Pigserver, "passwd"); The catch (Exception e) {}} is public static void Runidquery (Pigserver pigserver, String inputfile) throws IOException {Pigse Rver.registerquery ("A = Load '" + inputfile + "' Using Pigstorage (': ');"); Pigserver.registerquery ("B = foreach A generate $ as ID;"); Pigserver.store ("B", "idout"); } }
Two Java classes need to compile and compile the command:
JAVAC-CP.:/ Opt/hadoop/pig/pig-0.5.0-core.jar Idlocal.java
JAVAC-CP.:/ Opt/hadoop/pig/pig-0.5.0-core.jar Idmapreduce.java
Where Pig-0.5.0-core.jar is not in the current directory, you want to indicate its full path. (Source: Http://blog.csdn.net/inkfish)
1.Grunt Shell mode
The grunt shell method starts with the Pig command First, the Pig command can add the parameter "-X local" to represent the native mode, or "X-MapReduce" represents the MapReduce mode, the default mapreduce mode.
$ pig-x Local
$ pig
$ pig-x MapReduce
Enter Commands by line:
grunt> A = Load ' passwd ' using Pigstorage (': ');
grunt> B = foreach A generate $ as ID;
grunt> dump B;
Grunt> store B into ' out ';
where "Dump B" indicates that the result is displayed on the screen, "store B into '" means to output the result to an out file/folder. In local mode, the out file is written to the current directory; in MapReduce, the Out folder requires an absolute path. (Source: Http://blog.csdn.net/inkfish)
2.Pig Script Mode
In the script mode, start with the Pig command, followed by the. Pig file to run, such as:
$ pig-x Local Id.pig
$ pig Id.pig
$ pig-x MapReduce Id.pig
(Source: Http://blog.csdn.net/inkfish)
3. Embedded mode (source: http://blog.csdn.net/inkfish)
The embedded approach is not any different from running the normal Java class, such as:
JAVA-CP.:/ Opt/hadoop/pig/pig-0.5.0-core.jar Idmapreduce
JAVA-CP.:/ Opt/hadoop/pig/pig-0.5.0-core.jar idlocal (Source: http://blog.csdn.net/inkfish)