Hadoop Learning Note -16.pig Framework Learning

Source: Internet
Author: User
Tags hadoop fs

I. About PIG: don't think the pig can't work 1.1 pig introduction

Pig is a Hadoop-based, large-scale data analysis platform that provides the Sql-like language called Pig Latin, which translates the data analysis request of a class SQL into a series of optimized mapreduce operations. Pig provides a simple operation and programming interface for complex massive data parallel computing.

Compare: compared to the Java MapReduce Api,pig, which provides a higher level of abstraction for the processing of large datasets, Pig provides a richer data structure than mapreduce. are generally multi-valued and nested data structures. Pig also provides a more powerful set of data transformations, including connected join operations that have been overlooked in MapReduce.

Pig consists of two parts:

    • The language used to describe the data flow, called Pig Latin.
    • The execution environment for executing the Pig Latin program, which currently has two environments: a local execution environment in a single JVM and a distributed execution environment on a Hadoop cluster.

Inside pig, each operation or transformation is data processing of the input, and then produces the output, these transformations are converted into a series of mapreduce jobs , and pig lets the programmer not know how these transformations are done, so that the engineer can concentrate on the data, Rather than the details of the execution.

Characteristics of 1.2 pig

(1) Focus on a large number of data set analysis;
(2) running on the computing architecture of the cluster, Yahoo Pig provides a multi-layered abstraction that simplifies parallel computing for ordinary users, which automates the translation of user request queries into an effective parallel evaluation plan, and then executes the plans on the physical cluster;
(3) Provide SQL-like operation syntax;
(4) Open source code;

The difference between pig and hive in 1.3

For developers, direct use of Java APIs can be tedious or error-prone, while limiting the flexibility of Java programmers to program on Hadoop. Hadoop offers two solutions that make Hadoop programming easier.

Pig is a programming language that simplifies the work tasks common to Hadoop . Pig can load data, express transformed data, and store the final result. Pig built-in operations make it meaningful for semi-structured data (such as log files). At the same time, pig can expand the use of custom data types added in Java and Support data transformations.

Hive plays the role of Data Warehouse in Hadoop . Hive adds data to the structure in HDFs, and allows data queries using similar SQL syntax. Like pig, the core functionality of hive is extensible.

Pig and hive are always confusing. Hive is more suitable for data warehouse tasks, and hive is used primarily for static structures and for work that requires frequent analysis. The similarity between hive and SQL makes it an ideal intersection for Hadoop to be combined with other BI tools. Pig gives developers more flexibility in large data sets and allows for the development of concise scripts to transform data streams for embedding into larger applications. Pig is relatively lightweight compared to hive, and its main advantage is that it can drastically reduce the amount of code compared to using Hadoop Java APIs directly . Because of this, pig still attracts a lot of software developers.

Second, pig's installation configuration 2.1 preparatory work

Download pig's compressed package, which is used in pig-0.11.1 version, has been uploaded to the Baidu Network disk (URL:HTTP://PAN.BAIDU.COM/S/1O6IDFHK)

(1) via FTP tool upload to virtual machine, can choose Xftp, CuteFTP and other tools

(2) Decompression

TAR-ZVXF pig-0.11.1.tar.gz

(3) renaming

MV pig-0.11.1 Pig

(4) Modify the/etc/profile, add the content as follows, and finally re-enter the configuration file Source/etc/profile

Export Pig_home=/usr/local/pig

Export path=.: $HADOOP _home/bin: $PIG _home/bin: $HBASE _home/bin: $ZOOKEEPER _home/bin: $JAVA _home/bin: $PATH

2.2 Setting Pig to associate with Hadoop

Enter $pig_home/conf, edit the Pig.properties file, and add the following two lines of content:

fs.default.name=hdfs://hadoop-master:9000

mapred.job.tracker=hadoop-master:9001

Third, pig's use instance 3.1 file background

Combined with this note, the fifth article "Custom type processing mobile internet logs" of mobile internet logs for the background, we have to do is through the pig Latin the log traffic statistics. The data structure of the log is defined as follows: (The file is: Http://pan.baidu.com/s/1dDzqHWX)

  PS: upload the file to HDFs before using pig, which is uploaded to the/testdir/input directory

Hadoop fs-put Http_20130313143750.dat/testdir/input

3.2 Load: Convert data in HDFs to a pattern that pig can handle

(1) First enter the grunt by entering pig, and then use the load command to convert the original file to a pattern that pig can handle:

Grunt>a = LOAD '/testdir/input/http_20130313143750.dat ' as (T0:long,
Msisdn:chararray, T2:chararray, T3:chararray, T4:chararray, T5:chararray, T6:long, T7:long, T8:long, T9:long, T10: Chararray);

(2) through the pig's interpretation of instructions, we converted to a MapReduce task:

(3) You can view the results by using the following command:

Grunt>dump A;

3.3 FOREACH: Extract the useful fields from a

(1) Here we need to count only the mobile phone number and four traffic data, so we iterate through a portion of the fields extracted into B:

grunt> B = FOREACH A GENERATE msisdn, T6, T7, T8, T9;

(2) You can view the results by using the following command:

Grunt>dump B;

  

3.4 Group: Grouping data

(1) After the useful information extracted, see the results of a mobile phone number may have more than one record, so here by mobile phone number to group:

grunt> C = GROUP B by MSISDN;

(2) You can view the results by using the following command:

Grunt>dump C;

3.5 GENERATE: Traffic Summary

(1) After the cell phone number is grouped, we can see that a phone number corresponds to a number of traffic record data, so continue to use the Foreach Traversal packet data, and then summarize the four traffic data, the aggregate function sum () is used here:

grunt> D = FOREACH C GENERATE Group, sum (B.T6), sum (B.T7), sum (B.T8), sum (B.T9);

(2) You can view the results by using the following command:

Grunt>dump D;

  

3.6 Store: Storing statistical results in HDFS for persistence

(1) After the traffic statistics are complete, the result is still in pig, where it needs to be persisted and the results stored in HDFS:

grunt> STORE D into '/testdir/output/wlan_result ';

(2) View storage results via HDFs Shell:

Hadoop fs-text/testdir/output/wlan_result/part-r-*

Resources

(1) Yanghuahui, "Introduction, installation and use of Hadoop pig": http://www.cnblogs.com/yanghuahui/p/3768270.html

(2) Cloudsky, "Use of Hadoop (vi) Pig": http://www.cnblogs.com/skyme/archive/2012/06/04/2534876.html

(3) Rzhzhz, "Pig vs Hive": http://blog.csdn.net/rzhzhz/article/details/7557607

Zhou Xurong

Source: http://www.cnblogs.com/edisonchou/

The copyright of this article is owned by the author and the blog Park, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to give the original link.

Hadoop Learning Note -16.pig Framework Learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.