xml| Solution | full-Text Search
Copyright NOTICE: You can reprint, reprint, please be sure to hyperlink form to indicate the original source of the article and author information and this statement
Http://www.chedong.com/tech/weblucene.html
Content Summary:
Making a generic XML interface for Lucene has always been my biggest wish: more convenient embedding Full-text search in Web applications
The XML data input interface is provided: it is suitable to import the original data source based on various databases into the Full-text index, which guarantees the platform independence of the data source;
Through the output of search results based on XML: It is convenient to show the foreground result through XSLT;
mysql \ / jsp oracle-db - ==> XML ==> (Lucene Index) ==> xml - asp mssql nbsp; - php &nBsp MS Word/ \ /xhtml PDF/ =xslt=>- text \ XML \_________weblucene__________/use process is as follows:
Exporting data to XML format using script;
Import the XML data source into the Lucene index;
Get XML results output from Web interface and generate HTML page via XSLT
The necessity of full-text search in station
Although the functions of large search engines have become more and more powerful, many sites have used Google's site search site:domain.com instead of their own site database "Full-text" search. But relying on a large search engine such as Google to do site search will have the following drawbacks:
Limited number: Search engine does not go deep traverse a website, and will all the content of the site index, such as Google like static Web pages, and is the latest update, and do not like to take the Dynamic Web page, Google will be a regular lack of access to the site content gradually discarded;
Update slow: Search engine for the site update frequency is also a certain period of time, a lot of content needs to be a certain period before access to Google's index: the current Google dance cycle is about 21 days;
Content is not accurate: search engine needs to use the page content extraction technology to filter the navigation bar, the end of the page, and so on, instead of directly from the background database extraction data directly, this summary and the mechanism is difficult to achieve;
Unable to control output: There may be more output requirements, sorted by time, by Price, by click, Filter by category, etc.
System Setup
Download:
http://sourceforge.net/projects/weblucene/
Import of XML data sources:
As long as the data source can be exported as a 3-tier XML structure, it can be imported using the Indexrunner command line tool:
Export from database: News_dump.xml
<?xml version= "1.0" encoding= "GB2312"
<table>
<RECORD>
<Title> title </title>
<Author> author </author>
<Content> content </content>
<pubtime>2003-06-29 </PubTime>
</record>
<RECORD>
<title>my title</title>
<author>chedong</author>
<content>abc</content>
<PubTime> 2003-06-30</pubtime>
</record>
...
</table>
Indexrunner-i News_dump.xml-o c:\index-t title,content-n Author
-i news_dump.xml: to news_dump.xml as data source
The-o c:\index Index Library is established in the C:\index directory under the
Index building Title Author Content pubtime, indexed by the following rules:
-T Title,content A full-text index for word segmentation tokenindex: Data is title content the 2 fields
-n author an index that does not participle: notokenindex: The data source is Author this field.
For RSS data sources:
<?xml version= "1.0"
<rss version= "0.92"
<channel>
<title >amazon:books Arts & photography</title>
<link>http://www.lockergnome.com/</link
<description>amazon RSS feed</description>
<lastbuilddate>sun, June 2003 01:05:01 gmt</lastbuilddate>
<docs>http://www.lockergnome.com/</docs>
< webmaster>amazonfeed@lockergnome.com (lockergnome RSS Generator) </webmaster>
<item>
<title>the Artist ' s way:a spiritual Path to higher creativity-$11.17</title>
& nbsp; <link>http://www.amazon.com/exec/obidos/ASIN/1585421464/lockergnomedigit/?ref=nosim& Dev-it=d34huvgkb34yfx</link>
<description>http://www.lockergnome.com/ </description>
</item>
...
</channel>
Indexrunner-i Http://www.example.com/rss.xml-o c:\index-t title,description-n link-l 4
-L 4 means to take the 4th-tier node as a field map,
Indexrunner also offers-a-m options: For incremental indexing and batch index optimization.
-a incremental index, which is extended on the basis of an existing index
-M mergefactor in Lucene mergefactor is an optimization parameter for a batch index that controls how many records are processed (document), writes an index once, writes more frequently, uses less memory, but the index slows down, Therefore, in the mass data import need to increase the file write interval, more let the index in memory operation.
Search Results Output:
The following are some of the design ideas in the system design process:
XML as Industry standard
I remember there was a report about KFC's french fries Duandun. From this incident we can see a more efficient management system: for a global enterprise such as fast food restaurants, the cheapest way to ensure the quality of French fries is to rely on the machine rather than the chef, and if the potato chips are required to handle a variety of different shapes of potatoes, The complexity and maintenance costs of the machine will be high. So the potatoes must be in strict accordance with industry standards in order to make the simpler structure of the French fries machine to produce a standard french fries, so the processing machinery of French fries will be strictly in accordance with the potato industry standard design. High-quality raw materials can greatly reduce the cost of late processing equipment, so the overall cost is still cost-effective.
For software application developers: between application and application, the data exchanged between enterprises and enterprises is like potatoes, cabbage, the interface designed according to strict XML standard as the industrial standard of background data exchange between enterprises, although not as efficient as the simple CSV format, but the lack of can greatly simplify the downstream process cost.
It's not hard to imagine why browsers that deal with HTML: IE and Mozilla have more than 10M of browser software, but generally the parser for XML processing is hundreds of K. In addition to the absence of an interface, HTML browsers need to provide too many nonstandard HTML code a large number of fault-tolerant processing is also a very important reason, and the syntax is strict, simple XML processor can be done very short, efficient, the smaller the size Means more adaptability: this is especially important in a device environment where the hardware is less expensive than a mobile phone.
Although XML has great potential in the context of data exchange in the background. In terms of foreground performance, XML does not immediately replace HTML, and many HTML output via XSLT still needs to be performed with CSS. XML ==xslt==> HTML + CSS. But because too many pages are made of HTML, it is not necessary to replace these existing mechanisms with XML.
In addition, XML and Java are absolutely perfect for application internationalization support: XML data sources are Unicode when parsed in Java, so that we can search both in Japanese, in traditional Chinese and in German, in one index library. This support for other languages is just a matter of designing a variety of language interfaces.
gbk \ /BIG5 big5 - unicode ====> Unicode- gb2312 sjis - (XML) (XML) - sjis iso-8859-1 / \ ISO-8859-1
Another additional benefit of using XML is that: developers generally do not understand the Java character set (in fact, the default File.encoding property of the JVM) is affected by system localization settings, and xml-based input makes the character decoding process of the data transparent: no need to explain to the user how to decode, encode the data source 。 However, the learning cost of XML is relatively high, assuming that your HTML learning cost is 1,xml may be 10, and XSLT learning costs may be as high as 100.
The acceleration of full-text retrieval in traditional database application
The database is responsible for exact match, and fuzzy matching is implemented by independent system.
A site content accumulation in the million level, the station full text search will be the most important means of user positioning, and keyword search is the most familiar way users. Therefore, the traditional Web application based on database has great demand for full-text search.
But the dreaded%like% database operation may consume more than 90% of the CPU on the database server. In database servers such as Oracle MSSQL, the built-in Full-text search is basically less suitable for Web applications. Another disadvantage of the database is that the result set is very large for a simple query: The database does not know how to optimize for the first 100 results that the user is most concerned about. According to previous statistics: the first 100 results are often able to meet more than 95% user needs.
Cache design Required: Based on our experience, there is no need for built-in result caching design in the application design: It's enough to cache the built-in caching mechanism of the foreground application server or the counter-phase proxy cache server.
Data synchronization Policies
Generally speaking, Full-text search and database is actually 2 kinds of different application mode, Full-text retrieval system is often not necessary and database so high real-time synchronization mechanism, if according to: Low update, High-Cached mode design: The synchronization process of database data to Full-text indexing can typically be used to export database data to XML in a regular script, and then enter Lucene's Full-text index. And for the original data records of the update and deletion, in fact, can generally be fixed through the regular reconstruction index. Weblucene where the index portion is implemented by a Indexrunner command-line program.
Result ordering Policy
In-site Full-text indexing Another important requirement is a customizable sort: by time, by price, by clicks ... The Lucene full-text indexing default provides only a sort of matching degree based on the keyword in the original text, and any sort based on the value of a field can not avoid traversing the data again, resulting in an order of magnitude of performance degradation (equal to do%like% retrieval), and in the index, in addition to the matching degree score, The only thing that can be sorted is the ID of the index record, so a more efficient way to implement custom sorting: In the index, the order into the full text of lucene corresponds to a certain rule: such as time, and then when searching, let the search results be sorted (or inverted) by the ID of the index record.
Implementation of keyword indexing of search results
In search results, the keywords are marked in red or bold words, in order to be able to more appropriately display the relevant context of the problem, indexing is by restricting a scanning range, and then according to a parser to the specified word streaming read out, and then
Integration of Full-text search and other applications
In fact the core is a Lucene XML interface: SAX-style data import and DOM way results output.
Data source definition FOR XML:
As long as it is able to map to the table = "record =" field such hierarchy can be. Therefore, the design of the Weblucene index is more flexible and can even be used to index RSS directly.
XML result definition: Reference to the design of Google's XML interface
Without the servlet interface, domsearcher that provide XML output can also be easily integrated into a variety of application systems.
Resources:
Some of the modules used in the system design:
Jakarta Lucene:
http://jakarta.apache.org/lucene/
Xerces/xalan
http://xml.apache.org/
Log4j
http://jakarta.apache.org/log4j/
Google's XML Interface definition:
Http://www.google.com/google.dtd