System functional requirements:
1. You can customize the list of websites to be searched;
2. You can search the webpage content of the target list website.
Main function modules:
Web Spider: Collects, parses, and saves the content of the target list website (webpage ).
Full-text indexing/retrieval: index the content of the target list website to provide full-text retrieval of the content.
Solution:
Web Spider-uses the open-source framework heritrix, which is a crawler framework that can be added to some interchangeable components. Download Page: http://crawler.archive.org/index.html. For more information about how to use heritrix, see the relevant literature or the author's "crawling webpages with heritrix crawlers". I will not elaborate on it here.
Full-text indexing/retrieval-this part is implemented based on Lucene. Lucene is a sub-project of the 4 Jakarta Project Team of the Apache Software Foundation. It is an openSource codeIs not a complete full-text search engine, but a full-text search engine architecture, provides a complete query engine and index engine, some text analysis engines (two Western languages: English and German ). Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this. Lucene Official Website: http://developere.apache.org /.
Core classes of the Lucene index process:
Indexwriter: write operations on Indexes
Directory: Describes the index storage location.
Analyzer: analyzes text, extracts tokens, and removes useless information.
Document: Virtual document
Field: Each document contains one or more fields with different names. Each field corresponds to a piece of data, which may be queried during the search or retrieved in the index.
ExampleCode:
Directory dir =Fsdirectory. getdirectory (indexdir); analyzer anlyzer=NewSimpleanalyzer (); indexwriter writer=NewIndexwriter (Dir, analyzer,True); Document DOC=NewDocument (); Doc. Add (field. Keyword ("ID","1000"); Doc. add (field. unindexed ("name", "Yao Ming"); Doc. add (field. unstored ("Intro", "Yao Ming is a player of houseton rockets. "); writer. adddocument (DOC); writer. close ();
Core classes of the Lucene search process:
Indexsearcher: used to search for indexes created by indexwriter
Term: a basic unit used for searching includes a pair of string elements, which correspond to the field
Query: Abstract Query Class
Termquery: the most basic query type, used to match documents with specific values in a specific field
Hits: A simple container for storing sorted search result pointers
Sample Code:
Indexsearcher searcher =NewIndexsearcher (directory); term t=NewTerm ("Intro", "Yao"); query Query=NewTermquery (t); hits=Searcher. Search (query); assertequals ("JUnit test ",1, hits. Length ());
After heritrix downloads and obtains webpage resources, it needs to extract webpage content and extract webpage content in the Java World. Open-source software such as htmlparser or jsoup can be used.
Htmlparser-htmlparser is a pure library written in Java for HTML Parsing. It does not depend on other Java library files and is mainly used to transform or extract HTML. It can parse HTML at a high speed without errors. : Http://sourceforge.net/projects/htmlparser /.
Jsoup -- jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to jquery. Official Website: http://jsoup.org /.
When using Lucene to create an index, you need to cut words. Lucene comes with some cutting tools, but mainly for English and German. Chinese are not clearly separated by spaces like English, therefore, you need to add the Chinese word splitting function. There are many tools related to Chinese word segmentation, such as ikanalyzer, ictclas4j, And Je-analyer.
The entire system is mainly implemented based on heritrix and Lucene. You can build and improve search engines based on these two frameworks.