Apache Pig, a high-level query language for large-scale data processing, works with Hadoop to achieve a multiplier effect when processing large amounts of data, up to N times less than it is to write large-scale data processing programs in languages such as Java and C ++ The same effect of the code is also small N times. Apache Pig provides a higher level of abstraction for processing large datasets, implementing a set of shell scripts for SQL Server's data-processing scripting languages for the mapreduce algorithm (framework), called Pig Latin in Pig, We can sort the loaded data, filter, sum, group (group by), association (Joining), Pig can also be defined by the user some functions to operate on the data set, which is the legendary UDF (user -defined functions).
Pig has two modes of operation: Loca mode and MapReduce mode. When Pig runs in Local mode, Pig accesses only one local host; when Pig runs in MapReduce mode, it accesses the installation location of a Hadoop cluster and HDFS. At this point, Pig will automatically allocate and reclaim the cluster. Because the Pig system automatically optimizes the MapReduce program, the Pig system does not have to worry about the efficiency of the program when it comes to programming in the Pig Latin language. The Pig system automatically optimizes the program, saving you a lot of programming time. Pig Local mode and MapReduce mode has three modes of operation, namely: Grunt Shell mode, script file mode and embedded program mode.
Pig is a programming language that simplifies the common tasks that Hadoop does. Pig can load data, express the conversion data, and store the final result. Pig's built-in operations make semi-structured data meaningful (such as log files). At the same time, Pig can extend the use of custom data types added in Java and support data conversion.
Pig in the data type design concept summed up as a slogan: pig eat anything, the input data can support any format, pig native support for those popular formats, such as tab-delimited text file, the user can also add functions to support other Data format file, pig does not need metadata or data of schma, but if you can also use.
Apache Pig basic architecture
Pig's realization consists of 5 main parts:
Pig to achieve a set of framework for input and output of human-computer interaction part of the realization is Pig Latin.
Zebra is the middle tier of Pig and HDFS / Hadoop, Zebra is the client for MapReduce jobs, Zerbra manages the hadoop physical storage metadata in a structured language, and is a data abstraction layer for Hadoop. There are 2 cores in Zebra Class TableStore (Write) / TableLoad (Read) Operates on data on Hadoop.
Streaming Pig is divided into four components: 1. Pig Latin 2. Logical Layer 3. Physical Layer 4. Streaming Implementation, Streaming creates a Map / Reduce job and Send it to the appropriate cluster, and monitor the entire execution of the job in a clustered environment.
MapReduce Frameworks (algorithms) for distributed computing on each machine.
Part of HDFS final storage data.
Contrast with Hive
Language: Hive can perform "insert / delete" and other operations, but Pig I did not find a way to "insert" data.
There is at least one more "table" concept in Schemas: Hive, but I think Pig is basically a concept without tables. The so-called tables are built in Pig Latin scripts, not to mention Pig.
Partitions: There is no concept of tables in Pigs, so it's basically a matter of talking about partitions for Pig, but he still understands it if he says "partitioning" with Hive.
Server: Hive can rely on Thrift to start a server, providing remote calls. Pig did not find such a feature.
Shell: Pig you can perform some ls, cat this classic, very cool command, but when using Hive I never thought such a demand.
Web Interface: Hive Yes, Pig None
JDBC / ODBC: Pig None, Hive Yes
Pig application scenarios
Data query only for the relevant technical staff
Timely data processing needs, so pigs can quickly write a script to start running without the need to create tables and other related preparatory work.
Pig includes:
Pig Latin, a class SQL data processing language
Pig Latin execution engine running on Hadoop, converting pig scripts to map-reduce programs running in hadoop cluster
Pig's advantages:
Simple coding
Optimized for common operations
Scalable. Custom UDFs
Pig main user
Yahoo !: More than 90% of MapReduce jobs are generated by Pig
Twitter: More than 80% of MapReduce jobs are generated by Pig
Linkedin: Most of the MapReduce jobs are generated by Pig
Other major users: Salesforce, Nokia, AOL, comScore
Pig's main developer
Hortonworks
Twitter
Yahoo!
Cloudera
Pig tool
Piggybank (Pig official function library)
Elephant bird: Twitter's Pig library
DataFu: Linkedin's Pig library
Ambros: Twitter's Pig job monitoring system
Mortardata: Cloud-based Pig Cluster Management System
Pig positioning
Pig Latin language and the traditional database language is very similar, but Pig Latin more focused on data query. Rather than modify and delete data and other operations. The pig statement is usually written in the following format.
Read data from the file system through the LOAD statement
The data is processed through a series of "conversion" statements
Through a STORE statement to the processing output to the file system, or use the DUMP statement to the processing output to the screen.
LOAD and STORE statements have strict syntax rules. The key is the flexibility to use the conversion statement to process the data.
Pig Latin features:
Easy to program. It's easy to accomplish simple and highly parallel data analysis tasks.
Automatic optimization. The task coding approach allows the system to automatically optimize the execution process, allowing users to focus on logic, rather than efficiency.
Scalability, users can easily write their own functions for special-purpose processing.
The Pig Latin program consists of a series of operations and transformations. Each operation or transformation of the input data processing, and then produce the output. These operations describe a data flow as a whole. Inside Pig, these transformations are translated into a series of MapReduce jobs. Pig is not suitable for all data processing tasks, and MapReduce, it is designed for data batch processing. If you only want to query a small portion of data in a large data set, Pig's implementation will not be good because it will scan the entire data set or most of them.
Reference material
Pig Website: http://pig.apache.org