1. Main technologies used:
Lucene 2.3.1
IK_CAnalyzer 1.4 Chinese Word Segmentation
HtmlParser 1.6 HTML file/Text parser disadvantage: cannot ignore <! -----> Content
2. Other implementation methods:
Perform incremental index content for each category every day: type, URL, TEXT content, title, author, and time.
3. Create a table on Oracle 10 GB:
-- Create table
Create table IZ_SEARCH_ENGINE
(
Id number not null,
INDEX_DIR VARCHAR2 (50 ),
TYPE VARCHAR2 (500), TYPE
TYPE_DESC VARCHAR2 (50), type Annotation
TABLE_MAXVALUE VARCHAR2 (50), maximum value of a table
TABLE_SQLS CLOB, (the SQL statement that is not indexed to a table, such as select... from XXX where id> # ID #, # ID # from TABLE_MAXVALUE)
STATUS VARCHAR2 (20) default 'offline', useless currently
TYPE_TRUETYPE VARCHAR2 (50) temporarily useless
)
4. Key JAVA code for indexing:
String INDEX_DIR = "/home/xue24_index_book"; // specify the INDEX DIRECTORY
IndexWriter writer = new IndexWriter (INDEX_DIR, new IK_CAnalyzer (), true); // prepare the index area and specify the word segmentation Analyzer
Document doc = new Document (); // instance the new document
Doc. add (new Field ("type", "Community", Field. Store. YES, Field. Index. TOKENIZED); // set the Field for document: type
Doc. add (new Field ("title", "title" Field. Store. YES, Field. Index. TOKENIZED); // set the Field for document: title
Writer. addDocument (doc); // Add this document to the INDEX DIRECTORY
Writer. optimize (); // optimized
Writer. close (); // close the index
5. Key JSP code to be searched:
String INDEX_DIR_BOOK = "/home/xue24_index/book ";
String INDEX_DIR_BBS = "/home/xue24_index/bbs ";
Searcher [] searchers = new Searcher [2];
Searchers [0] = new IndexSearcher (INDEX_DIR_BOOK );
Searchers [1] = new IndexSearcher (INDEX_DIR_BBS );
Searcher searcher = new MultiSearcher (searchers );
MultiFieldQueryParser queryParser = new MultiFieldQueryParser (new String [] {"title", "content", "author"}, new IK_CAnalyzer ());
Query query = queryParser. parse (keyword); // analyze and Query
Hits hits = searcher. search (query); // search index
Out. println ("Total found results:" + hits. length ());
For (int I = 0; I Document doc = hits.doc (I );
Out. println ("title:" + doc. get ("title "));
}
6. Write another linux cron for regular execution, or use the quartz plug-in to complete the incremental index.
Lucene details: click here
Lucene: click here