Lucene learning-Word Segmentation and highlighting

Source: Internet
Author: User
Tags createindex

 

First, create the folder indexdocs and three TXT files in the path E: \ testlucene \ workspacese: l1.txt,l2.txt,l3.txt.

L1.txt content:

111111111111111111111111111111111111111111111111111111111111111111111111111
Information retrieval is to find information related to user requirements from the information set.
In addition to text, the retrieved information also includes multimedia information such as images, audios, and videos. Here we mainly talk about text information retrieval.
Full-text search: compares a user's query request with each word in the full text, regardless of the semantic match between the query request and the text,
In Information Retrieval Tools, full-text retrieval is the most universal and practical. (In general, the keyword is matched)

Data Retrieval: both query requirements and data in the information system follow a certain format and have a certain structure,
Allows you to search for specific fields. Its performance and usage are limited and semantic matching is supported.

Knowledge Retrieval: It emphasizes knowledge-based and semantic matching (the most complex, it is equivalent to knowing the answer to the search question,
Then directly search for the answer information ).

Full-text search refers to a computer indexing program that builds an index for each word by scanning every word in the document,
Specifies the number and location of the word in the article. When a user queries the word, the search program searches for the word based on the pre-created index, then, the search results are fed back to the user's retrieval method.

Data Retrieval and query requirements and data in the Information System must follow a certain format and have a certain structure, allowing specific fields to be searched.
For example, the data is stored in the form of "time, character, location, and event", and the query can be: Location = "Beijing ". The performance of data retrieval depends on the method used to identify the field and the user's understanding of this method, so it has great limitations.

 

L2.txt content:

 

2222222222222222222222222222222222222222222222222222222222222222222222222
Note: software that collects information on the internet is called crawlers or spider or Web Robots (things outside the search engine ),
Crawlers access each web page on the Internet. Each time a Web page is accessed, the content is sent back to the local server.
The most important task of information processing is to orchestrate indexes for locally collected information and prepare for queries.
The function of the word divider: it is used to split text resources and divide the text into the smallest unit (keyword) for indexing by rule)

 

L3.txt content:

333333333333333333333333333333333333333333333333333333333333333333333333
Chinese Word Segmentation: Chinese Word Segmentation is complicated because it is neither a word nor a word,
In addition, a word may not be a word in another place, for example, in "hats and clothes,
"Kimono" is not a word. There are three methods for Chinese Word Segmentation: Word Segmentation, binary segmentation, and dictionary word segmentation.
Word Segmentation: it refers to Word Segmentation Based on a Chinese character.
Bipartite Word Segmentation: split by two words
Dictionary word segmentation: Constructs words based on a certain algorithm, and then matches the dictionary set that has been created. If a word is matched, it is split into words,

 

Now that the preparation is complete, check the Code:

 

File2document. Java

 

Package Lucene. study; import Java. io. bufferedreader; import Java. io. file; import Java. io. fileinputstream; import Java. io. filenotfoundexception; import Java. io. ioexception; import Java. io. inputstreamreader; import Java. io. unsupportedencodingexception; import org.apache.w.e.doc ument. document; import org.apache.e.doc ument. field; import org.apache.e.doc ument. field. index; import org.apache.w.e.doc u Ment. field. store;/*** @ author xudongwang 2012-2-2 ** Email: xdwangiflytek@gmail.com */public class file2document {/* file ---> document ** @ Param filepath * file path ** @ return Document Object */public static document file2document (string filepath) {// file to be stored: name, content, size, pathfile file = new file (filepath); document = new document (); // store. yes whether to store Yes No compress (compressed and then stored) // whether the index is indexed Index. analyzed is indexed after word segmentation. not_analyzed is not indexed. not_analyzed // directly indexes document without word segmentation. add (new field ("name", file. getname (), field. store. yes, field. index. analyzed); document. add (new field ("content", readfilecontent (file), field. store. yes, field. index. analyzed); document. add (new field ("size", String. valueof (file. length (), field. store. yes, field. index. not_analyzed); // No word segmentation, but sometimes the index is required. The file size (INT) is converted to stringdocument. add (New Field ("path", file. getabsolutepath (), field. store. yes, field. index. not_analyzed); // return document does not need to be queried Based on the file path;}/*** 49. * Read File Content 50.*51. * @ Param file 52. * file object 53. * @ return file content * 54. */Private Static string readfilecontent (File file) {try {bufferedreader reader = new bufferedreader (New inputstreamreader (New fileinputstream (File); stringbuffer content = new stringbuffer (); try {(String line = NULL; (line = reader. Readline ())! = NULL;) {content. append (line ). append ("\ n") ;}} catch (ioexception e) {e. printstacktrace ();} // try {// byte temp [] = content. tostring (). getbytes ("UTF-8"); // string TT = new string (temp, "gb2312"); // system. out. println (TT); //} catch (unsupportedencodingexception e) {// E. printstacktrace (); //} return content. tostring ();} catch (filenotfoundexception e) {e. printstacktrace ();} return NULL;}/*** <PRE> * two methods for obtaining the name attribute value * 1. filed field = document. getfiled ("name"); * field. stringvalue (); * 2.doc ument. get ("name"); * </PRE> ** @ Param document */public static void printdocumentinfo (document) {// todo auto-generated method stubsystem. out. println ("index name -->" + document. get ("name"); // system. out. println ("content -->" + document. get ("content"); system. out. println ("index path -->" + document. get ("path"); system. out. println ("index size -->" + document. get ("size "));}}

Analyzerdemo. Java

 

Package Lucene. study; import Java. io. ioexception; import Java. io. stringreader; import Org. apache. lucene. analysis. analyzer; import Org. apache. lucene. analysis. simpleanalyzer; import Org. apache. lucene. analysis. tokenstream; import Org. apache. lucene. analysis. CJK. cjkanalyzer; import Org. apache. lucene. analysis. standard. standardanalyzer; import Org. apache. lucene. analysis. tokenattributes. chartermattribute; import Org. apache. lucene. analysis. tokenattributes. typeattribute; import Org. apache. lucene. util. version; public class analyzerdemo {public void analyze (analyzer, string text) {system. out. println ("----------- word divider:" + analyzer. getclass (); tokenstream = analyzer. tokenstream ("content", new stringreader (text); chartermattribute termatt = (chartermattribute) tokenstream. getattribute (chartermattribute. class); // typeattribute typeatt = (typeattribute) tokenstream. getattribute (typeattribute. class); try {While (tokenstream. incrementtoken () {system. out. println (termatt. tostring (); // system. out. println (typeatt. type () ;}} catch (ioexception e) {e. printstacktrace () ;}} public static void main (string [] dd) {analyzerdemo demo = new analyzerdemo (); system. out. println ("--------------- test English"); string entext = "Hello, my name is suolong, my csdn blog address is http://blog.csdn.net/lushuaiyin"; system. out. println (entext); system. out. println ("by standardanalyzer:"); analyzer = new standardanalyzer (version. required e_35); demo. analyze (analyzer, entext); system. out. println ("by simpleanalyzer:"); analyzer analyzer2 = new simpleanalyzer (version. required e_35); demo. analyze (analyzer2, entext); system. out. println ("the result above shows that the standardanalyzer does not press. while simpleanalyzer is based on. "); system. out. println (); system. out. println ("----------------> test Chinese"); string zntext = "thanks to the original Wang Xudong"; system. out. println (zntext); system. out. println ("by standardanalyzer:"); // The result shows that every word is used as a keyword, so the efficiency is definitely very low. demo. analyze (analyzer, zntext); system. out. println ("by cjkanalyzer (binary segmentation) Word Segmentation:"); analyzer analyzer3 = new cjkanalyzer (version. required e_35); demo. analyze (analyzer3, zntext );}}

Console printing:

 

--------------- Test English hello, my name is suolong, my csdn blog address is http://blog.csdn.net/lushuaiyinBy standardanalyzer way Word Segmentation: ----------- Word Segmentation: Class Org. apache. lucene. analysis. standard. standardanalyzerhellomynamesuolongmycsdnblogaddresshttpblog. csdn. netlushuaiyinby simpleanalyzer Word Segmentation: --------- Word Segmentation: Class Org. apache. lucene. analysis. simpleanalyzerhellomynameissuolongmycsdnblogaddressishttpblogcsdnnetlushuaiyin found through the above results that the standardanalyzer does not press. while simpleanalyzer is based on. to differentiate ----------------> test Chinese thanks to the original Wang Xudong by standardanalyzer method Word Segmentation: ----------- word divider: Class Org. apache. lucene. analysis. standard. standardanalyzer thanks to the original Wang Xudong by cjkanalyzer (binary segmentation) Word Segmentation: --------- Word Segmentation: Class Org. apache. lucene. analysis. CJK. cjkanalyzer thanks to Wang XuXu, the original author.

 

Highlighted example

Highlighterdemo. Java

 

Package Lucene. study; import Java. io. file; import Org. apache. lucene. analysis. analyzer; import Org. apache. lucene. analysis. standard. standardanalyzer; import org.apache.e.doc ument. document; import Org. apache. lucene. index. indexreader; import Org. apache. lucene. index. indexwriter; import Org. apache. lucene. index. indexwriterconfig; import Org. apache. lucene. index. indexwriterconfig. openmode; import Org. apache. luc Ene. queryparser. multifieldqueryparser; import Org. apache. lucene. queryparser. queryparser; import Org. apache. lucene. search. filter; import Org. apache. lucene. search. indexsearcher; import Org. apache. lucene. search. query; import Org. apache. lucene. search. scoredoc; import Org. apache. lucene. search. topdocs; import Org. apache. lucene. search. highlight. formatter; import Org. apache. lucene. search. highlight. fragmenter; impo RT Org. apache. lucene. search. highlight. highlighter; import Org. apache. lucene. search. highlight. queryscorer; import Org. apache. lucene. search. highlight. scorer; import Org. apache. lucene. search. highlight. simplefragmenter; import Org. apache. lucene. search. highlight. simplehtmlformatter; import Org. apache. lucene. store. directory; import Org. apache. lucene. store. fsdirectory; import Org. apache. lucene. util. version; Publ IC class highlighterdemo {/*** source file path */private string filepath01 = "E: \ testlucene \ workspacese \ l1.txt"; private string filepath02 = "E: \ testlucene \ workspacese \ l2.txt "; private string filepath03 =" E: \ testlucene \ workspacese \ l3.txt "; /***** index path */private string indexpath = "E: \ testlucene \ workspacese \ indexdocs";/*** word divider, here we use the default tokenizer, a standard analyzer (several, but not good at Chinese) */private analyzer = new St Andardanalyzer (version. required e_35);/*** create index ** @ throws exception */Public void createindex () throws exception {file indexfile = new file (indexpath); directory = fsdirectory. open (indexfile); // The writer configuration requires two parameters: version and word segmentation. There are other parameters, so I will not talk about them here. Indexwriterconfig conf = new indexwriterconfig (version. paie_35, analyzer); Conf. setopenmode (openmode. create); // The indexwriter index writer is used to operate (add, delete, modify) The indexwriter = new indexwriter (directory, conf) of the index Library; // two parameters are required, directory, writer configuration // document, that is, the document doc01 = file2document. file2document (filepath01); document doc02 = file2document. file2document (filepath02); document doc03 = file2document. file2document (filepa Th03); // Add document to indexwriter in the index library. adddocument (doc01); indexwriter. adddocument (doc02); indexwriter. adddocument (doc03); indexwriter. close (); // close the writer and release resources, index created}/*** search ** @ Param querystr * search keyword * @ throws exception */Public void search (string querystr) throws exception {// 1. parse the text to be searched into a query object // specify which fields to query string [] fields = {"name", "content "}; // queryparser: a tool used to parse user input. You can scan user input strings to generate query objects.. Queryparser = new multifieldqueryparser (version. paie_35, fields, analyzer); // query: fuzzy query, semantic query, phrase query, and combined query are supported in Lucene, such as termquery, booleanquery, rangequery, wildcardquery and other classes. Query query = queryparser. parse (querystr); // 2. Query file indexfile = new file (indexpath); // indexsearcher is used to Query Directory directory = fsdirectory in the index library. open (indexfile); indexreader = indexreader. open (directory); indexsearcher = new indexsearcher (indexreader); // filter, We can filter the results, you can block some content that you do not want to see. Filter filter = NULL; // 10000 indicates how many documents are queried in the database at a time. // topdocs is similar to the set topdocs = indexsearcher. search (query, filter, 10000); system. out. println ("total [" + topdocs. totalhits + "] documents contain matching results of \" "+ querystr +" \ "); // note that the matching results here refer to the number of documents, instead of containing the number of search results in the Document. // prepare the formatter = new simplehtmlformatter ("<font color = 'red'>", "</font> "); scorer = new queryscorer (query); highlighter = new highlighter (formatter, scorer); fragmenter = new simplefragmenter (100); // specify 100 characters of highlighter. settextfragmenter (fragmenter); // determines whether to generate the abstract and how long it is. // 3. Print the result for (scoredoc: topdocs. scoredocs) {int docsn = scoredoc.doc; // document internal No. Document document = indexsearcher.doc (docsn ); // extract the corresponding document according to the document number for highlighting. // return the highlighted result. If no keyword is displayed in the current attribute value, nullstring highlighterstr = highlighter is returned. getbestfragment (analyzer, "content", document. get ("content"); If (highlighterstr = NULL) {string content = document. get ("content"); int endindex = math. min (20, content. length (); highlighterstr = content. substring (0, endindex); // a maximum of the first 20 characters} system. out. println ("------- the highlighted content after processing ------ start ------------"); system. out. println (highlighterstr); system. out. println ("------- the highlighted content after processing ------ end ------------"); document. getfield ("content "). setvalue (highlighterstr); // file2document. printdocumentinfo (document); // Print Document Information} public static void main (string [] ARGs) throws exception {highlighterdemo = new highlighterdemo (); highlighterdemo. createindex (); // create an index // Lucene. createramindex (); // create the memory index highlighterdemo. search ("Word Segmentation"); // search for the text system you are looking. out. println ("--------------------------------------------------------------------"); highlighterdemo. search ("Search"); system. out. println ("--------------------------------------------------------------------"); highlighterdemo. search ("Index"); system. out. println ("--------------------------------------------------------------------");}}

Console printing:

 

A total of [3] documents contain results matching word segmentation ------- highlighted content after processing ------ start ------------ 333333333333333333333333333333333333333333333333333333333333333333333333 Chinese <font color = 'red'> Points </font> <font color = 'red'> word </font>: <font color = 'red'> Chinese </font> <font color = 'red'> words </font> are complex, because it is either a word or a highlighted content after processing ------ end ------------------- the highlighted content after processing ------ start ------------. <Font color = 'red'> function of the </font> <font color = 'red'> word </font> tool: <font color = 'red'> minute </font> <font color = 'red'> word </font>, <font color = 'red'> split text resources </font>, cut the text into <font color = 'red'> Points </font> according to rules into the smallest unit of index (Key <font color = 'red'> words </font>) ------- highlighted content after processing ------ end ------------------- highlighted content after processing ------ start. In addition to text, the retrieved information also includes multimedia information such as images, audios, and videos. Here we mainly talk about text information retrieval. Full-text search: compares a user's query request with each <font color = 'red'> word </font> In the full text, regardless of the match between the query request and the text semantics, in the information retrieval engineer ------- highlighted content after processing ------ end users a total of [2] documents contain results matching "Search" ------- highlighted content after processing ------ start ------------ 111111111111111111111111111111111111111111111111111111111111111111111111111 Information <font color = 'red'> check </font> <font color = 'red'> cable </font> is to find the information related to user requirements from the information set ------- after processing highlight content -- ---- End ------------------- the highlighted content after processing ------ start ------------ or a spider or network robot (search for something outside the engine <font color = 'red'> cable </font> ), crawlers access each web page on the Internet. Each time a Web page is accessed, the content is sent back to the local server. The main task of information processing is to orchestrate the information collected locally <font color = 'red'> cable </font>, prepare for the query ------- the highlighted content after processing ------ end users a total of [2] documents contain results matching the "Index" ------- the highlighted content after processing ------ start ------------ or spider or network robot (search <font color = 'red'> suo </font> <font color = 'red'> introduction </font> peripheral tasks ), crawlers access each web page on the Internet. Each time a Web page is accessed, the content is sent back to the local server. The main task of information processing is to orchestrate the information collected locally <font color = 'red'> cable </font> <font color = 'red'> reference </font>, prepare for the query ------- the highlighted content after processing ------ end ------------------- the highlighted content after processing ------ start ------------ semantic matching. Knowledge check <font color = 'red'> cable </font>: emphasizes knowledge-based and semantic matching (the most complex, it is equivalent to knowing the answer to the <font color = 'red'> Search </font> question, and then directly searching the answer information ). Full text check <font color = 'red'> cable </font> refers to the introduction of <font color = 'red'> cable </font> <font color = 'red'> </font> the program scans every word in the article, create one for each word ------- the highlighted content after processing ------ end --------------------------------------------------------------------------------

I put the content printed on the console into an HTML webpage to see the effect:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.