How Apache Pig playing with big data integrates with Apache Lucene

Last Update:2015-03-06 Source: Internet

Author: User

Tags solr hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before the article begins, let's simply review the behind me past of Pig: What is

1,pig?

Pig was originally a Hadoop-based parallel processing architecture for Yahoo, and later Yahoo donated pig to a project of Apache (an open source software fund), which was maintained by Apache, and Pig was a Hadoop's massive data analysis platform, which provides the sql-like language called Pig Latin, translates SQL-like data analysis requests into a series of optimized mapreduce operations. Pig provides a simple operation and programming interface for complex massive data parallel computing, which is as simple, clear, and easy to use as Facebook's open source hive, an open source framework for operating Hadoop in SQL mode. What is

2,pig used for?

To answer this question, you have to return to the original purpose of using pig:

1) to absorb and analyze the user's behavior log data (clickstream analysis, search content analysis, etc.), improve the matching and ranking algorithm, To improve the quality of search and advertising services.
2) Build and update search index. The content crawled by Web-crawler is a form of streaming data, which includes deduplication, link analysis, content categorization, popularity calculation based on clicks (PageRank), and the final setting of the inverted list.
3) Processing of semi-structured data Subscription (seeds) services. Includes: Deduplcaitin (de-redundancy), geographic location resolution, and named entity recognition.

3, pig's position in the Hadoop ecosystem

OK, the following answer to the topic, scattered fairy recently in a project is also about our station search keywords of the click-through analysis, our entire station of log data, all recorded in Hadoop, the initial task of the scattered fairy and the significance of this task is as follows:

(1) Find out the data from my station search
(2) Analyzing the number of searches in a given period
(3) Analyze the number of clicks of a keyword at a certain time
(4) Through these data, find out some search without clicks, search with click, search click Less, and less search click on some of the frontier keywords high
(5) Through the analysis of these keywords, to assess the quality of our station search, to optimize the search scheme, and improve to provide some reference
(6) Use Lucene or SOLR indexes to store analyzed data and provide flexible and powerful retrieval methods

The specific use of pig analysis data process, the scattered fairy here is not fine, interested friends, can be in the public post-message consulting, today mainly look at, pig analysis of the results of the data stored in the Lucene index, as for why Choose Lucene Series index Storage, Instead of database storage or directly stored in HDFs, the biggest reason is the speed, the analysis of the data is directly stored in HDFs, the storage HDFs is very good, and can backup, but also can tolerate disaster, but! But query, read, filter, conversion is very troublesome, slow to say, each time you have to read the data, and then use the Java program to calculate the final results, and then to the front-end display, even if the data is not too large, but the middle of the time to consume more data read, transmission and analysis, So this time in the analysis of the conversion rate of keywords, simply by the way, how to use pig and lucene,solr or elasticsearch integration.

Pig or hive itself can store various formats of files including binary, Json,avro, and Bzip,gzip,lzo,snappy,orc in HDFs or hbase, but cannot store Lucene indexes directly on HDFs , as to why it is not possible to directly store the index in HDFs, this and the inverted index file structure of the organization of a certain relationship, interested friends can be in the public message advisory, although not directly stored in HDFs, but we can indirectly through their UDF function to extend pig or hive, Allows them to support index storage, note that although the implementation of the index is stored in HDFs, but in essence, in the local temporary directory of the index, and then a disguised copy to the HDFs, is a kind of compromise formula. In pig, you need to define two UDFs to complete the index store, one is to create the index, the other is the index output, and on GitHub there are already Daniel implementations that we need to do:

(1) Visit this address to download the ZIP package.
(2) Extract the parts you want, and in the Eclipse project, modify the code that is customized to suit your environment (is the Lucene version compatible?). is the Hadoop version compatible? , is the Pig version compatible? ）。
(3) Repackaging into jars using ant
(4) In pig, register the dependent jar package and use the index store

Here is a script for the test of the scatter fairy:

Java Code

---registering dependencies on the related packages
register/home/search/nsconvent/spig/20150112/lucenepig/pigudf.jar;
register/home/search/nsconvent/spig/20150112/lucenepig/lucene-analyzers-common-4.10 . 2 . jar;
register/home/search/nsconvent/spig/20150112/lucenepig/lucene-core-4.10. 2 . jar;
register/home/search/nsconvent/spig/20150112/lucenepig/lucene-queryparser-4.10 . 2 . jar;
--Declare the alias reference (note that there is only no reference to the argument, with the parameter in the reflection when the exception occurs)
DEFINE Lucenestore Com.pig.support.lucene.LuceneStore;
--Load data
A = Load '/tmp/data/20150303/tt.txt ' using pigstorage (', ') as(lbl:chararray,desc: Chararray,score:int); ;
--Build the index and store it on HDFS, noting the need to configure a simple Lucene index (storage?). Is it indexed? ）
Store A into '/tmp/data/20150303/luceneindex ' using Lucenestore (' store[true]:tokenize[true] ' );

At this point, we have successfully stored the index on HDFS, do not be happy to kill, this is just a beginning, where you may have doubts, the index stored in HDFs can be directly queried or access it? The answer is yes, but it is not recommended that you directly read the HDFs index, even if the block cache with Hadoop to speed up, performance is still relatively low, unless your cluster machine is not lack of memory, otherwise, it is recommended that we directly copy the index to the local disk and then retrieve, This is a temporary trouble, scattered in the following article will explain how to put pig generated result set, directly in the form of HTTP stored in SOLR or elasticsearch, so it is very convenient and fast.

This article to the end, there are any questions of friends, can be directly in the public message consultation, the public number of the two-dimensional code at the end of the text can be seen, everyone with a mobile phone, you can directly pay attention to, well, do not pull, hurriedly go home, in the not to go, bitter force on the subway!!!

How Apache Pig playing with big data integrates with Apache Lucene

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More