recently, the scattered fairy used a few weeks of pig to deal with the analysis of our website search log data, feel very good, today wrote a note about the origin of pig, in addition to big data, probably very few people know what pig is doing, including some are programming, but not big data, Also includes some not to do programming, nor to engage in big data, but to engage in other industry friends, so it is possible to words too literally, a look at the title, on the music, the heart began to silently translated = = = "Apache Pig notes, it looks like Apache pig, more serious ah, can write notes."
open a joke, the following into the topic, scattered fairy, as far as possible to write the easy-to-understand, so that we can see after all understand what this pig is really doing.
Pig was originally a Hadoop-based parallel processing architecture for Yahoo, and later Yahoo donated pig to a project of Apache (an open source software fund), which was maintained by Apache, and Pig was a Hadoop's massive data analysis platform, which provides the sql-like language called Pig Latin, translates SQL-like data analysis requests into a series of optimized mapreduce operations. Pig provides a simple operation and programming interface for complex massive data parallel computing, which is as simple, clear, and easy to use as Facebook's open source hive, an open source framework for operating Hadoop in SQL mode.
So what is Yahoo's main use of pig for?
1) absorb and analyze the user's behavior log data (clickstream analysis, search content analysis, etc.), improve the matching and ranking algorithm to improve the quality of search and advertising services.
2) Build and update search index. The content crawled by Web-crawler is a form of streaming data, which includes deduplication, link analysis, content categorization, popularity calculation based on clicks (PageRank), and the final setting of the inverted list.
3) Processes semi-structured data Subscription (seeds) services. Includes: Deduplcaitin (de-redundancy), geographic location resolution, and named entity recognition.
Using pig to manipulate hadoop for processing massive amounts of data is very simple, and without pig, we have to write the MapReduce code, which is a very tedious thing, because the task of MapReduce is very clear, cleaning the data is a job, processing a job, Filtering a job, counting a job, sorting a job, writing a dag (with sequential dependencies) is inconvenient, which is acceptable, but every time a small change is made, the entire job needs to be recompiled, and then hit into a jar to commit to the Hadoop cluster to run, is very cumbersome, debugging is still very difficult, so in today's big internet companies or e-commerce companies, there is very few pure write mapreduce to handle a variety of tasks, basically use some tools or open source framework to operate.
with the advent of the data tsunami, the traditional db (Oracle, DB2) has not been able to meet the massive data processing needs, MapReduce gradually became the data processing fact standard, is applied to all walks of life. Therefore, we no longer expect all customers to quickly develop application-related code, can only make the customer's work easier, like the use of SQL language, after simple training can be "cloud" operation.
Pig is designed to block the cumbersome details of MapReduce development, providing users with a near-SQL language processing capability such as pig Latin, making it easier for users to process massive amounts of data. Pig translates SQL statements into a collection of jobs for Mr, and combines them in a way that streams data.
A simple process for pig is as follows:
the execution engine looks like this:
in pig, every step is a data flow, very easy to understand, what you want, what it can get, even if not, we can easily expand the UDF to achieve, more easily understand than SQL, what to do every step, very easy to learn, in the big Data age, It is very easy to understand and use pig to analyze massive amounts of data.
Finally, let me tell you the good news, in the latest pig (0.14) release, there are two important features:
(1) Support pig running on Tez
(2) Support for the ORC format of storage
If you can't wait to know pig, then please do not hesitate to click on the Pig website Link http://pig.apache.org/, there is a very full, very rich introduction and learning materials waiting for you to join!
Scan code attention to the public number: I am a Siege division (WOSHIGCS), if there is any doubt, technical problems, career problems or job search problems, etc., welcome to the public number on the message with me! Let's do a different siege division! Thank you!
Apache Pig's past life