Apache Pig Getting Started learning document (i)

Source: Internet
Author: User

Installation of 1,pig
(i) Software requirements
(ii) Download pig
(c) Compiling pig
2, Run pig
(i) All execution modes of pig
(ii) pig's interactive mode
(iii) Using pig script execution mode
3,pig Declaration of the Latin statement
(i) Loading data
(ii) Use and processing of data
(iii) Storage of intermediate data
(iv) Storage of final data
(v) Commissioning Pig Latin language
Attribute value Management for 4,pig
5,pig some precautions

Installation of 1,pig
(i) Software Installation
Must be configured:
(1) Hadoop

Http://hadoop.apache.org/common/releases.html
You can run different versions of pig at the same time as long as the corresponding hadoop_home can be set, if you do not set Hadoop_home,pig default will run embedded version of Hadoop (1.0.0)
(2) java1.6+

http://java.sun.com/javase/downloads/index.jsp
You need to install the JDK and set the Java_home
Optional configuration:
python2.5 (If you need to install when writing a UDF using Python)
JavaScript1.7 (If you need to install when writing UDFs using JavaScript)
JRuby1.6.7 (requires installation if JRuby is used to write the UDF)
Groovy1.8.6 (requires installation if you write a UDF using groovy)
Ant1.7 (if you need to compile the build, you need to download the installation, JAV, recommended installation)
Junit4.5 (need to install if unit test is required)
(ii) Download pig
Note the following points:
1, download the most recent and stable version of Apache Pig
2, then unzip to download pig, note the following two points:
Pig's main script file, pig in the bin directory (/pig.n.n.n/bin/pig), which includes the definition of pig's environment variable
Pig's properties file, pig.properties in the Conf directory (/pig.n.n.n/conf/pig.properties) You can also specify the absolute path to the file by Pig_conf_dir the environment variable.

3. Configure the environment variables for pig, as in the following code:
$ export Path=/<my-path-to-pig>/pig-n.n.n/bin: $PATH
4, Test pig installation is unsuccessful, use the PIG-HELP command

(c): Compiling pig
1, import the pig source code from SVN
SVN Co http://svn.apache.org/repos/asf/pig/trunk
2, go to pig's root directory, execute ant command, compile pig
3, check Pig.jar, run a unit test ant tests
2, Run pig

Inside Pig, you can run it in a number of different modes:

Whether the ordinal pattern name supports native mode Hadoop cluster mode
1 Interactive mode support support
2 batch processing mode support support

(i) Mode of implementation:
Pig has two modes of operation or type of operation:
Local mode: Running the local mode is very simple, you only need a machine, all the files and scripts are on the local disk, the specified mode uses the command Pig-x flag (for example: Pig-x local), the local mode does not support MapReduce (thread) parallelism, Because in the current version of Hadoop, the Localjobrunner runner for Hadoop is not a thread-safe class.
Hadoop cluster mode: The Hadoop cluster mode is also known as the map reduce mode, if your machine is already installed Hadoop cluster, and can run normally, the cluster mode is the pig default mode, without any declaration or specify the case, Pig jobs are always run in clustered mode, and of course you can use the command pig or pig-x mapreduce to specify the mode of operation
Example:
Start with the Pig command:
(1): Pig-x local (local mode)
(2) pig-x MapReduce (cluster mode)
Start mode with java command:
(1), JAVA-CP pig.jar org.opache.pig.main-x Local (local mode)
(2), JAVA-CP Pig.jar org.opache.pig.main-x mapreduce (cluster mode)

(ii) Interactive mode:
We can use pig in an interactive mode by using the grunt shell, call the grunt shell, just execute the pig command, and then we will operate the pig on the command line, as in the example below:
grunt> A = Load ' passwd ' using Pigstorage (': ');
grunt> B = foreach A generate as ID;
grunt> dump B;

(iii) Script mode
We can package pig's series of processing, packaged into a pig script file, the suffix name with. Pig end, I believe that Linux under the shell script written by friends are very good understanding, we put our Linux commands in the. Sh script, it is very convenient to execute, and easy to manage.

If we now have a test.pig script, how do we do it?
(1) Run in local mode: Pig-x local Id.pig
(2) Run in cluster mode: Pig-x Mapreduce.pig


Benefits of using the Pig script file:
We can encapsulate the pig syntax Declaration and the Pig command in a script file for pig, and the prefix name. Pig end, which is very helpful for us to differentiate these scripts

We can run pig in the command line and the grunt shell, using the run or EXEC command, where the scatter fairy does not give an example, and the article will be written later.
The pig script also supports external parameters, which are similar to those of shell scripts, and are very flexible, and are written later in the article.

Notes for Pig:
(1) Multiline Comment:/*pig script statement */
(2) When line comment:--pig script statement two
Attention:
Pig support runs directly on HDFs, Amazon S3, or some other distributed system script or a jar package, and if it is on a distributed system, we need to specify the URL path of the network at runtime, for example:

$ pig Hdfs://nn.mydomain.com:9020/myscripts/script.pig


Statement Declaration of 3,pig Latin:

In pig, Pig Latin is the basic syntax for processing data using pig, similar to the way we use SQL statements in a database system.

We use the Pig Latin statement, get an input, and then after a series of processing, we get an output, so in all pig scripts, only load (read data) and store (write data) two statements are necessary.

In addition, the syntax blocks of Pig may also include, some expressions and schema,pig Latin can form a span across multiple lines of command, must be in the pattern of parentheses, and must end with a semicolon. ( ; )


Pig Latin statements, usually organized as follows:
(i) A load declaration to load data from a file system
(ii) A series of conversion statements to process the data
(c) A dump statement to show the result or Stroe statement to store the result

Only dump and store statements can produce output




(a) Loading data:
Read data into pig using the load operation and (Load/store) function (the default storage mode is Pigstorage)
(ii) Use and processing of data
Pig allows you to work with data in a variety of ways, and if we're just getting started, being familiar with the following operators will help us use and understand pig.
Use the filter statement to filter a tuple or row of data (similar to where in SQL)
Use the foreach statement to manipulate the data for the column (similar to select Field1,filed 2, ....) The limit column is returned from the table. )
Use group statements to group. (Group by in SQL like)
Use Cogroup, inner Join,outer join to group or correlate more than two table associations (similar to join in SQL)
? Using the Union statement to merge the result data of more than two relationships, use the split statement to split a table into multiple scattered small tables (note that the scatter fairy here says tables, just for convenience of understanding, in pig no table this concept, although there is a similar structure)
(iii) Storage of intermediate result sets
The intermediate result set that the pig generates is stored in a temporary location in HDFs, which must already exist in HDFs, which can be configured to use the Pig.temp.dir property, which is stored in the/tmp directory by default. In the previous version of 0.7, this value is fixed, and after 0.7, we can flexibly change the path by configuring

(iv) Storage of the final result set
Using the store operation and Load/store functions, the result set can be written to the file system, the default storage format is Pigstorage, in our testing phase, we can use the dump command, directly display the results on our screen, convenient for us to debug, in a production environment, We typically use the store statement to store our result set permanently.

(v) Commissioning Pig Latin
Pig provides a write operator to help us debug our results:
? Use the DUMP statement to display the results on our terminal screen
? Use the describe statement to show the relationship of our schema (similar to the structure of the view table)
? Use the explain statement to display our execution logic or physical view to help us see the Map,reduce execution plan
? Using the illustrate statement, you can view our statement execution step by step

In addition, pig also defines some very important alias sets to quickly help us debug the script:
? Dump alias \d
? Describe's alias \de
? Explain's alias \e
? illustrate's alias \i
? Exit \q
Property value of 4,pig
Pig supports the properties file in Java, and we can customize the behavior of pig by using this file, and we can use the help command to see all of the pig's property values



How do I specify the value of a pig?

In the Pig.properties file, note that you need to include this file in the Java classpath
The?-d command specifies a pig property on the command line for example: Pig-dpig.tmpfilecompression=true
The-P command can specify a properties file of its own. Example: Pig-p mypig.properties
? set command, for example: set Pig.exec.nocombiner True

Note: The properties file, using the standard Java properties file format



The place their priority is as follows:
Pig.properties <-D Pig property <-P properties file < set Comman

Specifies that the file configuration properties of Hadoop are the same as pig.

All the attribute values of Hadoop and pig are eventually collected uniformly in pig, and it is valid for any UDF
For example Udfcontext objects, in order to access these properties, we can call the Getjobconf method


4,pig some precautions
1. Make sure your JDK is ready to install
2. Make sure that your pig's Bin directory execution script environment variable has been installed
Export Path=/<my-path-to-pig>/pig-0.9.0/bin: $PATH
3. Make sure your pig_home environment variables are valid
Export pig_home=/<my-path-to-pig>/pig-0.9.0
4. Configure the ant script to compile the pig's documentation
5. Configure Pig_classpath to specify all the configuration files required for the cluster, including the Core-site.xml,hdfs-site.xml and Mapred-site.xml of Hadoop
6. Master some basic UDF functions of pig
? Extracthour, extracting hours from each row of data
? Ngramgenerator, generating the words of n-garms
? Nonurldetector, remove an empty column, or the value is the URL of the data
? Scoregenerator, calculate N-garm's score
? ToLower, turn lowercase
? Tutorialutil, the split query string consists of a words


The above UDF is some of the more typical examples, scattered fairy suggest beginners can first look at, do not understand also does not matter, the odds of using UDF is not particularly large, the most important is the use of basic grammar, about the configuration environment installation, if you are using native Apache Hadoop, Then follow the above steps, is very effective, because this document, is referring to the official Apache document translation, English good, you can directly click on this link http://pig.apache.org/docs/r0.12.0/ Start.html, if it is another version of Hadoop, such as CDH or HDP, you may use cm (Cloudera Manager) or AM (Ambari) to install it, so you can save yourself the process of installing it, you could use pig to process the data, but , the beginner still suggest oneself to toss a bit, late skilled, can use some management tool, to install automatically, this can learn more in-depth, after reading, have the question of welcome to correct, or to group public message.


If you have any questions, welcome to scan the public number: I am the Siege division (WOSHIGCS)
The content of this public number is about big data technology and the Internet and other aspects of the sharing, but also a warm technical interactive exchange of small homes, there are any problems at any time can leave a message, welcome everyone to visit!

Apache Pig Getting Started learning document (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.