High-level language for the Hadoop framework: Apache Pig

Source: Internet
Author: User
Keywords Can achieve statement
Tags .mall apache code data data processing file framework functions

Apache Pig, a high-level query language for large-scale data processing, works with Hadoop to achieve a multiplier effect when processing large amounts of data, up to N times less than it is to write large-scale data processing programs in languages ​​such as Java and C ++ The same effect of the code is also small N times. Apache Pig provides a higher level of abstraction for processing large datasets, implementing a set of shell scripts for SQL Server's data-processing scripting languages ​​for the mapreduce algorithm (framework), called Pig Latin in Pig, We can sort the loaded data, filter, sum, group (group by), association (Joining), Pig can also be defined by the user some functions to operate on the data set, which is the legendary UDF (user -defined functions).

Pig has two modes of operation: Loca mode and MapReduce mode. When Pig runs in Local mode, Pig accesses only one local host; when Pig runs in MapReduce mode, it accesses the installation location of a Hadoop cluster and HDFS. At this point, Pig will automatically allocate and reclaim the cluster. Because the Pig system automatically optimizes the MapReduce program, the Pig system does not have to worry about the efficiency of the program when it comes to programming in the Pig Latin language. The Pig system automatically optimizes the program, saving you a lot of programming time. Pig Local mode and MapReduce mode has three modes of operation, namely: Grunt Shell mode, script file mode and embedded program mode.

Pig is a programming language that simplifies the common tasks that Hadoop does. Pig can load data, express the conversion data, and store the final result. Pig's built-in operations make semi-structured data meaningful (such as log files). At the same time, Pig can extend the use of custom data types added in Java and support data conversion.

Pig in the data type design concept summed up as a slogan: pig eat anything, the input data can support any format, pig native support for those popular formats, such as tab-delimited text file, the user can also add functions to support other Data format file, pig does not need metadata or data of schma, but if you can also use.

Apache Pig basic architecture

Pig's realization consists of 5 main parts:

Pig to achieve a set of framework for input and output of human-computer interaction part of the realization is Pig Latin.

Zebra is the middle tier of Pig and HDFS / Hadoop, Zebra is the client for MapReduce jobs, Zerbra manages the hadoop physical storage metadata in a structured language, and is a data abstraction layer for Hadoop. There are 2 cores in Zebra Class TableStore (Write) / TableLoad (Read) Operates on data on Hadoop.

Streaming Pig is divided into four components: 1. Pig Latin 2. Logical Layer 3. Physical Layer 4. Streaming Implementation, Streaming creates a Map / Reduce job and Send it to the appropriate cluster, and monitor the entire execution of the job in a clustered environment.

MapReduce Frameworks (algorithms) for distributed computing on each machine.

Part of HDFS final storage data.

Contrast with Hive

Language: Hive can perform "insert / delete" and other operations, but Pig I did not find a way to "insert" data.

There is at least one more "table" concept in Schemas: Hive, but I think Pig is basically a concept without tables. The so-called tables are built in Pig Latin scripts, not to mention Pig.

Partitions: There is no concept of tables in Pigs, so it's basically a matter of talking about partitions for Pig, but he still understands it if he says "partitioning" with Hive.

Server: Hive can rely on Thrift to start a server, providing remote calls. Pig did not find such a feature.

Shell: Pig you can perform some ls, cat this classic, very cool command, but when using Hive I never thought such a demand.

Web Interface: Hive Yes, Pig None

JDBC / ODBC: Pig None, Hive Yes

Pig application scenarios

Data query only for the relevant technical staff

Timely data processing needs, so pigs can quickly write a script to start running without the need to create tables and other related preparatory work.

Pig includes:

Pig Latin, a class SQL data processing language

Pig Latin execution engine running on Hadoop, converting pig scripts to map-reduce programs running in hadoop cluster

Pig's advantages:

Simple coding

Optimized for common operations

Scalable. Custom UDFs

Pig main user

Yahoo !: More than 90% of MapReduce jobs are generated by Pig

Twitter: More than 80% of MapReduce jobs are generated by Pig

Linkedin: Most of the MapReduce jobs are generated by Pig

Other major users: Salesforce, Nokia, AOL, comScore

Pig's main developer

Hortonworks

Twitter

Yahoo!

Cloudera

Pig tool

Piggybank (Pig official function library)

Elephant bird: Twitter's Pig library

DataFu: Linkedin's Pig library

Ambros: Twitter's Pig job monitoring system

Mortardata: Cloud-based Pig Cluster Management System

Pig positioning

Pig Latin language and the traditional database language is very similar, but Pig Latin more focused on data query. Rather than modify and delete data and other operations. The pig statement is usually written in the following format.

Read data from the file system through the LOAD statement

The data is processed through a series of "conversion" statements

Through a STORE statement to the processing output to the file system, or use the DUMP statement to the processing output to the screen.

LOAD and STORE statements have strict syntax rules. The key is the flexibility to use the conversion statement to process the data.

Pig Latin features:

Easy to program. It's easy to accomplish simple and highly parallel data analysis tasks.

Automatic optimization. The task coding approach allows the system to automatically optimize the execution process, allowing users to focus on logic, rather than efficiency.

Scalability, users can easily write their own functions for special-purpose processing.

The Pig Latin program consists of a series of operations and transformations. Each operation or transformation of the input data processing, and then produce the output. These operations describe a data flow as a whole. Inside Pig, these transformations are translated into a series of MapReduce jobs. Pig is not suitable for all data processing tasks, and MapReduce, it is designed for data batch processing. If you only want to query a small portion of data in a large data set, Pig's implementation will not be good because it will scan the entire data set or most of them.

Reference material

Pig Website: http://pig.apache.org

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.