High-level language for the Hadoop framework: Apache Pig

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Can achieve statement

Tags .mall apache code data data processing file framework functions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apache Pig, a high-level query language for large-scale data processing, works with Hadoop to achieve a multiplier effect when processing large amounts of data, up to N times less than it is to write large-scale data processing programs in languages such as Java and C ++ The same effect of the code is also small N times. Apache Pig provides a higher level of abstraction for processing large datasets, implementing a set of shell scripts for SQL Server's data-processing scripting languages for the mapreduce algorithm (framework), called Pig Latin in Pig, We can sort the loaded data, filter, sum, group (group by), association (Joining), Pig can also be defined by the user some functions to operate on the data set, which is the legendary UDF (user -defined functions).

Pig has two modes of operation: Loca mode and MapReduce mode. When Pig runs in Local mode, Pig accesses only one local host; when Pig runs in MapReduce mode, it accesses the installation location of a Hadoop cluster and HDFS. At this point, Pig will automatically allocate and reclaim the cluster. Because the Pig system automatically optimizes the MapReduce program, the Pig system does not have to worry about the efficiency of the program when it comes to programming in the Pig Latin language. The Pig system automatically optimizes the program, saving you a lot of programming time. Pig Local mode and MapReduce mode has three modes of operation, namely: Grunt Shell mode, script file mode and embedded program mode.

Pig is a programming language that simplifies the common tasks that Hadoop does. Pig can load data, express the conversion data, and store the final result. Pig's built-in operations make semi-structured data meaningful (such as log files). At the same time, Pig can extend the use of custom data types added in Java and support data conversion.

Pig in the data type design concept summed up as a slogan: pig eat anything, the input data can support any format, pig native support for those popular formats, such as tab-delimited text file, the user can also add functions to support other Data format file, pig does not need metadata or data of schma, but if you can also use.

Apache Pig basic architecture

Pig's realization consists of 5 main parts:

Pig to achieve a set of framework for input and output of human-computer interaction part of the realization is Pig Latin.

Zebra is the middle tier of Pig and HDFS / Hadoop, Zebra is the client for MapReduce jobs, Zerbra manages the hadoop physical storage metadata in a structured language, and is a data abstraction layer for Hadoop. There are 2 cores in Zebra Class TableStore (Write) / TableLoad (Read) Operates on data on Hadoop.

Streaming Pig is divided into four components: 1. Pig Latin 2. Logical Layer 3. Physical Layer 4. Streaming Implementation, Streaming creates a Map / Reduce job and Send it to the appropriate cluster, and monitor the entire execution of the job in a clustered environment.

MapReduce Frameworks (algorithms) for distributed computing on each machine.

Part of HDFS final storage data.

Contrast with Hive

Language: Hive can perform "insert / delete" and other operations, but Pig I did not find a way to "insert" data.

There is at least one more "table" concept in Schemas: Hive, but I think Pig is basically a concept without tables. The so-called tables are built in Pig Latin scripts, not to mention Pig.

Partitions: There is no concept of tables in Pigs, so it's basically a matter of talking about partitions for Pig, but he still understands it if he says "partitioning" with Hive.

Server: Hive can rely on Thrift to start a server, providing remote calls. Pig did not find such a feature.

Shell: Pig you can perform some ls, cat this classic, very cool command, but when using Hive I never thought such a demand.

Web Interface: Hive Yes, Pig None

JDBC / ODBC: Pig None, Hive Yes

Pig application scenarios

Data query only for the relevant technical staff

Timely data processing needs, so pigs can quickly write a script to start running without the need to create tables and other related preparatory work.

Pig includes:

Pig Latin, a class SQL data processing language

Pig Latin execution engine running on Hadoop, converting pig scripts to map-reduce programs running in hadoop cluster

Pig's advantages:

Simple coding

Optimized for common operations

Scalable. Custom UDFs

Pig main user

Yahoo !: More than 90% of MapReduce jobs are generated by Pig

Twitter: More than 80% of MapReduce jobs are generated by Pig

Linkedin: Most of the MapReduce jobs are generated by Pig

Other major users: Salesforce, Nokia, AOL, comScore

Pig's main developer

Hortonworks

Twitter

Yahoo!

Cloudera

Pig tool

Piggybank (Pig official function library)

Elephant bird: Twitter's Pig library

DataFu: Linkedin's Pig library

Ambros: Twitter's Pig job monitoring system

Mortardata: Cloud-based Pig Cluster Management System

Pig positioning

Pig Latin language and the traditional database language is very similar, but Pig Latin more focused on data query. Rather than modify and delete data and other operations. The pig statement is usually written in the following format.

Read data from the file system through the LOAD statement

The data is processed through a series of "conversion" statements

Through a STORE statement to the processing output to the file system, or use the DUMP statement to the processing output to the screen.

LOAD and STORE statements have strict syntax rules. The key is the flexibility to use the conversion statement to process the data.

Pig Latin features:

Easy to program. It's easy to accomplish simple and highly parallel data analysis tasks.

Automatic optimization. The task coding approach allows the system to automatically optimize the execution process, allowing users to focus on logic, rather than efficiency.

Scalability, users can easily write their own functions for special-purpose processing.

The Pig Latin program consists of a series of operations and transformations. Each operation or transformation of the input data processing, and then produce the output. These operations describe a data flow as a whole. Inside Pig, these transformations are translated into a series of MapReduce jobs. Pig is not suitable for all data processing tasks, and MapReduce, it is designed for data batch processing. If you only want to query a small portion of data in a large data set, Pig's implementation will not be good because it will scan the entire data set or most of them.

Reference material

Pig Website: http://pig.apache.org

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More