ApachePig Entry 1-Introduction/basic architecture/comparison with Hive

Source: Internet
Author: User
Tags hadoop fs
This article is divided into four segments: 1. introduction 2. basic Architecture 3. comparison with Hive 4. i. Introduction: Google engineers have developed a Sawzall tool for MapReduce implementation. Google has published several papers online, however, this code is not open-source. The design philosophy is open-source. In the previous article, I also mentioned Hadoop.

This article is divided into four segments: 1. introduction 2. basic Architecture 3. comparison with Hive 4. i. Introduction: Google engineers have developed a Sawzall tool for MapReduce implementation. Google has published several papers online, however, this code is not open-source. The design philosophy is open-source. In the previous article, I also mentioned Hadoop.

This article is divided into four parts: 1. Introduction 2. Basic Architecture 3. Comparison with Hive 4. Use

I. Introduction
Google engineers developed a Sawzall tool to facilitate MapReduce implementation. Google put several papers online, however, this code is not open-source. In the previous article, I mentioned that Hadoop also launched a Pig language similar to Sawzall, it is based on the papers posted by Google.

Pig is an abstraction layer for processing super-large datasets. In MapReduce, there are two functions: map and reduce. If you create a MapReduce to compile code, compile, and deploy, running this MapReduce program on Hadoop takes some time. With Pig, you can not only simplify MapReduce development, in addition, data can be converted between different data types. For example, some transformations contained in the connection are not easy to implement in MapReduce.

Apache Pig can be run locally, decompressed, and typed"Bin/pig-x local"The command runs directly, which is very simple. This is the legendary local mode, but it is often not used in this way,All of them connect Pig to the hdfs/hadoop cluster environment. I can see that Apache Pig implements a shell script for the mapreduce algorithm (framework ).Similar to the SQL statement we are familiar with, it is called Pig Latin in Pig. In this script, we can sort, filter, sum, and group the loaded data), join (Joining), Pig can also be customized by the user to operate on the dataset, that is, the legendary UDF (user-defined functions ).

After Pig Latin conversion, it becomes a MapReduce job. It classifies and summarizes the result sets processed by multiple threads, processes, or independent systems in parallel. Map () and Reduce () functions run in parallel, even if they are not in the same system at the same time, they run a set of tasks at the same time. After all the processing is completed, the results are sorted, formatted, and saved to a file. Pig uses MapReduce to divide computing into two stages. The first stage is divided into small blocks and distributed to each node that stores data for execution. The computing pressure is dispersed, the second stage aggregates the results of the first stage, which can achieve very high throughput and drive parallel computing of thousands of machines through a small amount of code and workload, make full use of computer resources to eliminate bottlenecks in operation.

Therefore, Pig can easily query TB-level massive data, and these massive data volumes are unstructured data,For example:A pile of files may be log4j output logs stored on multiple disks across multiple computers, used to record the health status logs of thousands of online servers, transaction days, IP address access records, application Service logs. We usually need to count or extract these records, or query abnormal records to form reports for these records and convert the data into valuable information. In this way, the query will be more complex, at this time, products like MySQL are not able to meet our needs for speed and execution efficiency, and Apache Pig can help us achieve this goal.

On the contrary, if you convert 100 rows of MySQL Data into text files and put them in pig for query during the experiment, you will be very disappointed, why is the efficiency of this short 100-row data query extremely low? Because there is a MapReduce job generation process in the middle, this is unavoidable, therefore, a small amount of data queries are not suitable for pig, just like cutting vegetables with a big knife of guan er. In addition, you can also use Pig APIs to call them in the Java environment. For Apache Pig and above, please allow me to have such one-sided understanding. Thank you.


Ii. Basic Architecture

On the whole, a large amount of data is gathered on the HDFS system, and MapReduce operations are simplified by inputting SQL-like scripts, let these lines of code/scripts drive thousands of machines for parallel computing.
:

Pig implementation includes5Main components:
:

1.Pig Latin is a set of self-implemented Pig frameworks for implementing the Human-Computer Interaction of input and output.
2.Zebra is the middle layer between Pig and HDFS/Hadoop, and Zebra is the client for MapReduce job writing. Zerbra uses a structured language to manage hadoop physical storage metadata and is also the data abstraction layer for Hadoop, in Zebra, there are two core classes: TableStore (write) and TableLoad (read) to operate data on Hadoop.
3.Streaming in Pig is mainly divided into four components: 1. pig Latin 2. logical Layer 3. physical Layer (Physical Layer) 4. streaming creates a Map/Reduce job and sends it to the appropriate cluster. It also monitors the entire execution process of the job in the cluster environment.
4.The Framework (algorithm) of MapReduce distributed computing on each machine ).
5.The part where HDFS ultimately stores data.

Iii. Comparison with Hive
It is boring for me to compare planes and trains, because the two are not comparable in depth, although both are a high-speed means of transportation, however, the specific scope of action is completely different, just as Hive and Pig are both Hadoop projects, and Hive and pig have many things in common, but Hive also seems to have a bit of database shadows, pig is basically a tool (SCRIPT) for MapReduce implementation ). Both have their own language of expression. The purpose is to simplify the implementation of MapReduce, and read and write operation data is ultimately stored in the HDFS distributed file system. It seems that Pig and Hive are somewhat similar, but they are also somewhat different. For a simple comparison, Let's first look at a figure:

Click here to view the big chart

Let me talk a little nonsense:
Language
You can perform insert/delete operations in Hive, but I have not found any data insertion method in Pig. Please allow me to consider this as the biggest difference for the time being.

Schemas
Hive has at least one concept of "table", but I think there is basically no table in Pig. The so-called table is created in Pig Latin script, do not mention metadata for Pig.

Partitions
Pig does not have the table concept, so Partition is basically free of discussion for Pig. If you say "Partition" with Hive, he can still understand it.

Server
Hive can start a server based on Thrift to provide remote calls. I haven't found such a function in Pig after searching for a long time. If you have any new discoveries, you can tell me that someone has developed a Hive REST

Shell
In Pig, you can execute some very classic and cool commands such as ls and cat, but I never thought of such a requirement when using Hive.

Web Interface
Hive and Pig

JDBC/ODBC
Pig none, Hive has


Iv. Use

1 start/run
There are two servers. One is pig server and the other is hdfs server.
First, you need to configure on the pig server, direct the pig configuration file to the hdfs server, and modify
Vim/work/pig/conf/pig. properties
Add the following content:
Fs. default. name = hdfs: // 192.168.1.201: 9000/# point to the HDFS Server
Mapred. job. tracker = 192.168.1.201: 9001 # point to the address of the MR job server

If this is the first time you run the command, create the root directory on the Hadoop HDFS server and put the passwd file under the etc directory under the root directory of HDFS. Run the following two commands.
Hadoop fs-mkdir/user/root
Hadoop fs-put/etc/passwd/user/root/passwd

Create a running script and use the vim command to create the javabloger_testscript.pig file on the pig server. The content is as follows:
LoadFile = load 'passwd' using PigStorage (':');
Result = foreach LoadFile generate $0 as id;
Dump Result;

Run the pig script, for example, pig javabloger_testscript.pig. The execution status is as follows:

Execution result:

2. Run the java code and print the running result.
Import java. io. IOException;
Import java. util. Iterator;

Import org. apache. pig. PigServer;
Import org. apache. pig. data. Tuple;

Public class LocalPig {
Public static void main (String [] args ){
Try {
PigServer pigServer = new PigServer ("local ");
RunIdQuery (pigServer, "passwd ");
} Catch (Exception e ){
}
}

Public static void runIdQuery (PigServer pigServer, String inputFile) throws IOException {
PigServer. registerQuery ("LoadFile = load'" + inputFile + "'using PigStorage (':');");
PigServer. registerQuery ("Result = foreach A generate $0 as id ;");
Iterator Result = pigServer. openIterator ("Result ");
While (result. hasNext ()){
Tuple t = result. next ();
System. out. println (t );
}
// PigServer. store ("B", "output ");

}
}

-End-

Original article address: Apache Pig Entry 1-Introduction/basic architecture/comparison with Hive. Thank you for sharing it with me.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.