On-site full-text retrieval solution based on Lucene/XML

Source: Internet
Author: User
Document directory
  • Allow the database to perform exact matching and implement Fuzzy Matching using an independent system
  • Data Synchronization Policy
  • Result sorting Policy
  • Keyword indexing of search results

Author: chelong Email: chedongatbigfoot.com/chedongatchedong.com

Last Update written on: 2003/05:

03/16/2005 16:30:45
Feed Back> (read this before you ask question)

Copyright Disclaimer: You can reprint the document at will. During reprinting, you must mark the original source and author information of the article as hyperlinks and this statement.
Http://www.chedong.com/tech/weblucene.html

Keywords: Lucene xml xslt Web Site Search Engine

Summary:
Creating a general XML interface for Lucene has always been my biggest wish: more convenient embedding full-text retrieval in Web Applications

  • Provides an XML data input interface: Suitable for importing original data sources based on various databases to full-text indexes, ensuring platform independence of data sources;
  • XML-based search result output: the front-end result is displayed through XSLT;

MySQL \/jsp
Oracle-db-==> XML ==> (Lucene index) ==> XML-Asp
MSSQL/-PHP
MS word // XHTML
PDF/= XSLT =>-Text
\ XML
\ _________ Weblucene __________/
The procedure is as follows:
  1. Export data in XML format using scripts;
  2. Import the XML data source to the Lucene index;
  3. Get XML result output from web interface and generate HTML page through XSLT
Necessity of intra-site full-text retrieval

Although the functions of large search engines have become more and more powerful, many sites use Google's site retrieval site: domain.com instead of their own site database full-text retrieval. However, relying on large search engines such as Google for intra-site retrieval has the following drawbacks:

  • Limited Quantity: the search engine does not traverse a website in depth, but indexes all the content of the website. For example, Google prefers static Web pages and the latest update, rather than static Web pages? Google may even regularly discard the content of websites that do not have an entry point;
  • Slow update: The frequency of site-specific updates by the search engine also takes a certain period of time. A lot of content takes some time before it can be indexed by Google. Currently, Google dance has a cycle of about 21 days;
  • Inaccurate content: the search engine needs to use the page content extraction technology to filter content such as navigation bar and page header and footer. Instead, it is better to extract data directly from the background database, this mechanism of summarization and deduplication is very difficult to implement;
  • Unable to control output: there may be more output demands, sorted by time, by price, by clicks, by category, etc.
System Construction

Download:
Http://sourceforge.net/projects/weblucene/

Import an XML Data source:

As long as the data source can be exported to a three-layer XML structure, you can use the indexrunner command line tool to import the data:

For example, export from a database: news_dump.xml
<? XML version = "1.0" encoding = "gb2312"?>
<Table>
<Record>
<Title> title </title>
<Author> author </author>
<Content> content </content>
<Pubtime> 2003-06-29 </pubtime>
</Record>
<Record>
<Title> my title </title>
<Author> chedong </author>
<Content> ABC </content>
<Pubtime> 2003-06-30 </pubtime>
</Record>
...
</Table>

Indexrunner-I news_dump.xml-o c: \ index-T title, content-N author
-I news_dump.xml: Use news_dump.xml as the data source
-O c: \ Index library is created under the c: \ INDEX DIRECTORY
In addition to the Title Author content pubtime field, the index is created according to the following rules:
-T title, content: Full-text index tokenindex for Word Segmentation: data is the two fields of title content
-N Author: notokenindex: the data source is the author field.

For RSS data sources:
<? XML version = "1.0"?>
& Lt; RSS version = "0.92" & gt;
<Channel>
<Title> Amazon: Books Arts & amp; photography </title>
<Link> http://www.lockergnome.com/</link>
<Description> Amazon RSS feed </description>
<Lastbuilddate> Sun, 29 Jun 2003 01:05:01 GMT </lastbuilddate>
<Docs> http://www.lockergnome.com/</docs>
<Webmaster> the amazonfeed@lockergnome.com (lockergnome RSS generator) </Webmaster>
<Item>
<Title> the artist's way: a spiritual path to higher creati--$11.17 </title>
<Link> http://www.amazon.com/exec/obidos/ASIN/1585421464/lockergnomedigit? Ref = nosim & amp; Dev-It = d34huvgkb34yfx </link>
<Description> http://www.lockergnome.com/</description>
</Item>
...
</Channel>

Indexrunner-I http://www.example.com/rss.xml-o c: \ index-T title, description-n link-L 4
-L 4 indicates that layer-3 nodes are used as field ing,

Indexrunner also provides the-a-m option for incremental indexing and batch index optimization.
-A incremental index, indicating that the index is extended based on the original Index
-M mergefactor in Lucene, mergefactor is an optimization Parameter for batch indexing. After controlling how many records are processed (document), write the index once. The higher the write frequency, the less memory is used, but the slower the indexing speed is. Therefore, when importing large volumes of data, you need to increase the file writing interval and make the indexes operate in the memory.

Search result output:

The following are some design ideas in the system design process:
XML as the Industrial Standard

I remember the previous reports about the broken fried chicken fries. From this incident report, we can see a more efficient management system: for global enterprises such as fast food restaurants, we must ensure the quality of french fries provided by each region, the lowest cost must be based on machines rather than chefs. If the potato bar machine is required to handle potatoes of different shapes, the complexity and maintenance costs of the machine will be high. Therefore, potatoes must strictly comply with industrial standards to produce standard-compliant potato chips with simple structures. Therefore, the processing machinery of potato chips will be strictly designed according to the potato industrial standards of the Potato Association. High-quality raw materials can greatly reduce the cost of post-processing equipment, so the overall cost is still cost-effective.

For software application developers, the data exchanged between applications and enterprises is like potato and cabbage, interfaces designed according to strict XML standards serve as the industrial standard for data exchange between enterprises in the background. Although they are not as efficient as simple CSV formats, they cannot greatly simplify the post-processing costs of downstream processes.

It is hard to imagine why HTML browsers: IE, Mozilla, and other browser software are larger than 10 MB, but the XML parser is usually several hundred kb. In addition to the absence of interfaces, HTML browsers also need to provide a large number of error tolerance Processing for too many nonstandard HTML code. The syntax is strict, simple XML processors with simple rules can be very short and efficient. The smaller the size, the wider the Adaptability: this is especially important for devices with low hardware configurations such as mobile phones.

Although XML has great potential in data exchange in the background. In terms of front-end performance, XML does not immediately replace html. Many HTML output through XSLT still need to be presented with CSS. Xml = XSLT => HTML + CSS. However, because too many webpages are made of HTML, we believe that XML does not need to replace these existing mechanisms immediately.

In addition, XML and Java are out-of-the-box in terms of application internationalization support: the XML data source uses Java to parse and then Unicode, so that whether it is Japanese, we can search for traditional Chinese or German content in an index Library at the same time. In this way, the support for other languages is only a problem of designing interfaces for various languages.

      GBK          \                                       / BIG5
BIG5 - UNICODE ====> Unicode - GB2312
SJIS - (XML) (XML) - SJIS
ISO-8859-1 / \ ISO-8859-1

Another extra benefit of using XML is that developers generally do not carefully understand the Java character set (in fact, JVM's default file. the encoding attribute) is affected by the system localization settings. XML-based input makes the data character decoding process transparent: you do not need to explain how to decode the data source and encode the data source. However, the learning cost of XML is still relatively high. Suppose your learning cost of HTML is 1, and XML may be 10, while the learning cost of XSLT may be as high as 100.
The full-text retrieval acceleration of traditional database applications enables the database to perform exact matching and implement Fuzzy Matching using an independent system.

If the content of a site has accumulated more than ten thousand levels, full-text search within the site will be the most important means for user positioning, and keyword search is the most familiar method for users. Therefore, traditional database-based Web applications require a great deal of full-text retrieval.

However, the terrible % like % database operation may consume more than 90% of the CPU of the database server. Full-text retrieval built into databases in Oracle MSSQL and other database servers is basically not suitable for Web applications. Another drawback of the database is that the returned result set for a simple query is very large: the database does not know how to optimize the first 100 results most concerned with users. According to previous statistics, the first 100 results often meet the needs of more than 95% users.

Cache Design: based on our experience, there is no need to design built-in results cache in Application Design: it is enough to enable the built-in caching mechanism of the front-end application server or the reverse proxy cache server to cache data.
Data Synchronization Policy

In general, full-text retrieval and database are actually two fundamentally different application modes. In fact, the full-text retrieval system usually does not have to have the same real-time synchronization mechanism as the database, the high-Cache mode is designed: the synchronization process from the database data to the full-text index generally allows you to regularly export the database data to XML through scripts, and then enter the full-text index of Lucene. In fact, regular re-indexing can be used to update and delete original data records. In weblucene, the index is implemented by an indexrunner command line program.
Result sorting Policy

Another important requirement for full-text indexing in the site is customizable sorting: by time, by price, by clicks ...... By default, Lucene full-text indexes only support sorting based on the matching degree of keywords in the original text. However, any sorting based on the value of a field cannot avoid traversing data again, as a result, the performance is decreased by an order of magnitude (equivalent to % like % search). In the index, except for matching score, the unique index record ID can be sorted, therefore, when a more efficient method is used to achieve custom sorting: During indexing, the order of the full text in Lucene corresponds to a certain rule: for example, time, and then when searching, sort the search results by the index record ID (or inverted ).
Keyword indexing of search results

Keywords in the search results are marked in red or black. In order to display the relevant context more appropriately, indexing restricts the scanning range, then, according to a analyzer, read the specified word stream, and then
Full-text retrieval and integration with other applications

In fact, the core is a Lucene XML interface: data import in the sax mode and result output in the DOM mode.

XML data source definition:
As long as it can be mapped to a table = "record =" field in this hierarchy. Therefore, the design of weblucene indexes is flexible and can be directly used to index RSS.

XML result definition: refer to the design of Google's XML interface

Without the servlet interface, domsearcher that provides XML output can also be easily integrated into various application systems.

References:

Some modules used in system design:
Jakarta Lucene:
Http://jakarta.apache.org/lucene/

Xerces/xalan
Http://xml.apache.org/

Log4j
Http://jakarta.apache.org/log4j/

Google's XML interface definition:
Http://www.google.com/google.dtd

<A href = "http://www.chedong.com/tech/weblucene.html"> http://www.chedong.com/tech/weblucene.html </a>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.