How to integrate Apache Pig with Apache Lucene

Source: Internet
Author: User
Tags hadoop ecosystem

How to integrate Apache Pig with Apache Lucene

Before the beginning of this article, let's simply review Pig's history:

1. What is Pig?

Pig was originally a Hadoop-based parallel processing architecture of Yahoo. Later, Yahoo donated Pig to a project of Apache (an open-source software Organization), which is maintained by Apache, pig is a Hadoop-based large-scale data analysis platform. Its SQL-like language is Pig Latin, the compiler of this language converts SQL-like Data Analysis requests into a series of optimized MapReduce operations. Pig provides a simple operation and programming interface for complex Massive Data Parallel Computing. This is similar to FaceBook's open-source Hive (an open-source framework for operating hadoop in SQL mode) simple, clear, and easy to use!

2. What is Pig used?

To answer this question, we have to return to Yahoo's original use of Pig:

1) absorb and analyze user behavior log data (Click stream analysis, search content analysis, etc.), improve matching and ranking algorithms, and improve the quality of search and advertising services.
2) Build and update the search index. The content captured by web-crawlers is a form of stream data, which includes de-redundancy, link analysis, content classification, and PageRank) and finally create an inverted table.
3) process the semi-structured data subscription (data seeds) service. Including deduplcaitin, geographic location resolution, and named entity recognition.

3. Pig's position in the Hadoop Ecosystem

OK. The answer to the question below is as follows: a project recently developed by sanxian is also the clickthrough rate analysis of keywords for our site search. All the log data of our entire site is recorded on Hadoop, the initial task and significance of this task are as follows:

(1) Find the data for self-Site Search
(2) number of searches for analysis keywords in a certain period of time
(3) number of clicks of analysis keywords in a certain period
(4) use the data to find some boundary keywords with no clicks in the search, with few clicks in the search, and with few clicks in the search
(5) analyze these keywords to evaluate the search quality of our site, and provide some reference for the optimization and improvement of the search solution.
(6) Use Lucene or Solr indexes to store analyzed data and provide flexible and powerful Retrieval Methods

The specific data analysis process using Pig is not detailed here. If you are interested, you can leave a message in the background of the public account, how to store the data results analyzed by Pig to Lucene indexes? Why choose lucene index storage instead of database storage or direct storage on HDFS, the biggest reason is the speed. The data analyzed in the previous period was directly stored in HDFS, which is good for storing HDFS, backup, and disaster tolerance,! However, query, read, filter, and conversion are very troublesome. If the speed is slow, you can't say that you have to read data every time. Then, you can use the JAVA program to calculate the final result and display it to the front-end, even though the data volume is not too large, it takes a lot of time to read, transfer, and analyze the data. So this time, when analyzing the keyword conversion rate, I will simply study it by the way, how to integrate Pig with Lucene, Solr, or ElasticSearch.

Pig or Hive can directly store files in binary, json, avro, bzip, gzip, lzo, snappy, orc, and other compression formats on HDFS or Hbase, however, Lucene indexes cannot be directly stored on HDFS. As to why indexes cannot be directly stored on HDFS, this is related to the file structure of inverted indexes, if you are interested, you can leave a message on the public account. Although you cannot store it directly on HDFS, you can use their UDF function to expand Pig or Hive, so that they support index storage. Although the index is stored on HDFS, it is actually written into an index in a local temporary directory and then copied to HDFS in disguise, it is a compromise formula. In Pig, we need to define two udfs to complete index storage. One is mainly to create indexes, and the other is index output, which has been implemented by Daniel on github. What we need to do:

(1) access this address to download the compressed package.
(2) extract the desired part and modify the code that suits your environment in the eclipse project (is Lucene compatible? Is the hadoop version compatible ?, Is Pig compatible ?).
(3) Use ant to package it into a jar
(4) Register relevant jar packages in pig and use index Storage

The following is a test script for xianxian:

1. --- register dependent packages

2. REGISTER/home/search/nsconvent/spig/20150112/lucenepig/pigudf. jar;

3. REGISTER/home/search/nsconvent/spig/20150112/paiepig/lucene-analyzers-common-4.10.2.jar;

4. REGISTER/home/search/nsconvent/spig/20150112/paiepig/lucene-core-4.10.2.jar;

5. REGISTER/home/search/nsconvent/spig/20150112/paiepig/lucene-queryparser-4.10.2.jar;

6.

7. -- declare alias reference (note that only reference with no parameter is allowed. If the parameter is included in the reflection, an exception occurs)


8. DEFINE elastic estore com. pig. support. lucene. Elastic estore;

9. -- load data

10. a = load '/tmp/data/20150303/tt.txt' using PigStorage (',') as (lbl: chararray, desc: chararray, score: int );;

11. -- generate an index and store it on HDFS. Be sure to configure the simple lucene index method (whether to store it? Index ?)

12. store a into '/tmp/data/20150303/luceneindex' using external estore ('store [true]: tokenize [true]');

So far, we have successfully stored the index on HDFS. Don't be so happy. This is just the beginning. You may have doubts here, can indexes stored on HDFS be directly queried or accessed? The answer is yes, but it is not recommended that you directly read the index on HDFS. Even if Hadoop's Block Cache speed is increased, the performance is still relatively low, unless your Cluster machine has no memory shortage, we recommend that you directly copy the index to a local disk for retrieval, in the subsequent articles, xianxian will introduce how to store pig-generated result sets in Solr or ElasticSearch using Http, which is convenient and convenient.

-------------------------------------- Split line --------------------------------------

Indexing and searching based on Lucene multi-index

Lucene (version 2nd) Chinese edition supporting source code

Lucene (version 2nd) PDF

Use Lucene-Spatial to implement full-text retrieval of integrated geographical locations

Lucene + Hadoop distributed search runtime framework Nut 1.0a9

Lucene + Hadoop distributed search runtime framework Nut 1.0a8

Lucene + Hadoop distributed search runtime framework Nut 1.0a7

Project 2-1: Configure Lucene and create a WEB query system [Ubuntu 10.10]

-------------------------------------- Split line --------------------------------------

Lucene details: click here
Lucene: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.