Lucene Learning Notes (i) Introduction and indexing of documents and reading of documents __lucene

Source: Internet
Author: User
Tags documentation

What is Lucene?

The official document of Lucene http://lucene.apache.org/core/

Lucene is a full-text search toolkit.

What can lucene do?

1. Get content (Acquire contents)
Lucene does not provide reptilian functionality and needs to build its own crawler applications if it needs to acquire content.
Lucene only makes indexing and searching work.

2. Create documents (build document)
Documents are usually composed of domains (fields), such as headings, bodies, abstracts, and so on.
You need to ensure that the document is in a consistent format (TXT format, for example)
In this process, you can use semantic analysis to refine the document to be saved, or to determine whether the domain and document are important by weighting values.
The weighted value can be indexed again, or weighted at the time of search.

3. Analytical documentation (Analyze document)
Resolves if the control conforms to the word, solves the spelling error, whether the synonym is associated, and whether the singular plural form is folded.
Whether to retain the deviation of the results, when non-Latin expression of the language, how to identify the word.

4. Establish document index (index document)

5. Search
Supports single or compliant queries, phrase queries, wildcard characters, fuzzy queries, and result ordering
Support for error spelling corrections, etc.

6 Create query (build query)

7. Search queries (search query)

8 return results (Rednder Results)


Lucene's word breaker?

introduction of common word breaker


Whitespaceanalyzer

Just remove the space, there is no other action, does not support Chinese.


Simpleanalyzer

All the symbols except the letters are removed, and all the characters become lowercase, it is necessary to note that the word breaker also removes the data and also does not support Chinese.


Stopanalyzer

This is similar to the Simpleanalyzer, but the addition of one is, on the basis of the so-called stop words, such as the, a, this. This is also not supported in Chinese.


StandardAnalyzer

The English side of the processing and stopanalyzer the same, support for Chinese, the use of word cutting.


Cjkanalyzer

This supports China, Japan and South Korea, the first three letters is the abbreviation of these three countries. This is basically not how to use Chinese, the support of Chinese is very bad, it is used every two words as segmentation, split way personal feeling more exotic, I will compare the following examples.


Smartchineseanalyzer

Chinese participle. Compared to the standard Chinese participle, some of the search processing is not very good

Step one: Open Eclipse, create maven present

Step Two: Configure the Pom.xml file with the following code:

        <pre name= "code" class= "Java" > <!--junit package because it is a Java program, it needs to use @test, which is his jar package download. --> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <v Ersion>4.12</version> <scope>test</scope> </dependency> <!--Lucene Core Pack The following three are used in L
			Ucene's all jar packages, core is the kernel package, Queryparser is the query jar package. Query indexed file If it is all in English, Pom.xml file write these three, on the euro!--> <dependency> <groupId>org.apache.lucene</groupId> &l T;artifactid>lucene-core</artifactid> <version>5.3.1</version> </dependency> <!--enquiries Parser--> <dependency> <groupId>org.apache.lucene</groupId> <artifactid>lucene-queryparse R</artifactid> <version>5.3.1</version> </dependency> <!--analyzer--> <dependency&
			Gt <groupId>org.apache.lucene</groupId> <artifactid>lucene-analyzers-common</artifactId> <version>5.3.1</version> </dependency> <!--Obviously, this is the case where the query is retrieved for all Chinese, plus the above three, plus these two.
			Out.
			It is worth mentioning that the "highlighted" jar package can be added without adding, in this case, because it will be used later.
		 	But it is not a mistake to suggest that you should add this and know more. --> <!--Chinese word finder smartcn--> <dependency> <groupId>org.apache.lucene</groupId> <art Ifactid>lucene-analyzers-smartcn</artifactid> <version>5.3.1</version> </dependency> & lt;! --Highlight--> <dependency> <groupId>org.apache.lucene</groupId> <artifactid>lucene-highli Ghter</artifactid> <version>5.3.1</version> </dependency>

Step three: Build a package name

Step four: Write the index to the document, the following code:

Package Com.lucene.Demo;
Import Java.io.File;
Import Java.io.FileReader;

Import java.nio.file.Paths;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.document.TextField;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;
Import Org.apache.lucene.store.Directory;

Import Org.apache.lucene.store.FSDirectory;  /** * Class Introduction: *① simple to write to the document index, *② read the document based on the index, *③ use the path to find the indexed document, find the return result/public class Indexer
	
	{//write index instance to the specified directory private IndexWriter writer;  /** * Construction method: For instantiating IndexWriter/public Indexer (String indexdir) throws exception{//Getting index directory dir =
		
		Fsdirectory.open (Paths.get (Indexdir));
		
		Instantiation Analyzer Analyzer Analyzer = new StandardAnalyzer (); Instantiating indexwriterconfig indexwriterconfig con = new Indexwriterconfig (analyzer);
	
	Instantiate indexwriter writer = new IndexWriter (dir, con);
	/** * Closes the Write index * @throws Exception */public void Close () throws exception{Writer.close (); /** * Index All files of the specified directory * @throws Exception/public int index (String datadir) throws exception{//definition file array, follow
		
		Ring to index the file file[] File = new file (datadir). Listfiles ();
		for (file files:file) {//From this start, index each file indexfile (files);
		
	//Returns the number of files indexed, several return writer.numdocs (); /** * Index Specifies file * @throws Exception/private void Indexfile (file files) throws Exception {System.out.prin
		
		TLN ("Index file:" +files.getcanonicalpath ());
		
		The index is a line-by-line search, in the data for the document, so get all the rows, that is, documents document = GetDocument (files);
	
	Start writing and write the document into the index file; writer.adddocument (document); /** * Obtain documentation in the document, set three fields * * * * To obtain documents, equivalent to a row in the database * * @throws Exception */Private document GetDocument (File fil  ES) throws Exception {//materialized document Document DOC = new document ();             
                Add (): Add a set index to the document so that you can determine which documents are indexed.
		
		Doc.add (New TextField ("Contents", New FileReader (files)); Field.Store.YES: Save the file name in the index file, no means that you do not need to add to the index file Doc.add (new TextField ("FileName", Files.getname (), Field.Store.YES)
		
		);
	
		The full path exists in the index file Doc.add (new TextField ("FullPath", Files.getcanonicalpath (), Field.Store.YES));
	Returns to document return doc;
		
		//Start test write index public static void main (string[] args) {//index-specified document path String Indexdir = "E:\\lucenedemo";
		
		The path String datadir = "E:\\lucenedemo\\data" of the indexed data;
		Write index Indexer Indexer = null;
		
		int numindex = 0;
		
		Index start time long start = System.currenttimemillis ();

			try {///through the path specified by the index, get indexer indexer = new indexer (indexdir);
			
		The data path that will be indexed (int: Because this is the data to index, how many index files are returned) Numindex = Indexer.index (DataDir);
		catch (Exception e) {e.printstacktrace ();
			}finally{try {indexer.close (); catch (Exception e) {//TODO auto-generated catch block E. Printstacktrace ();
		
		}//Index end time Long ends = System.currenttimemillis ();
		
	Display result System.out.println ("index" +numindex+ "file, cost" + (End-start) + "millisecond");
 }

}
The results of the finished index are displayed as follows:

Step five: Read the document from the Index field in the document with the following code:

Import java.nio.file.Paths;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.index.DirectoryReader;
Import Org.apache.lucene.index.IndexReader;
Import Org.apache.lucene.queryparser.classic.QueryParser;
Import Org.apache.lucene.search.IndexSearcher;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.ScoreDoc;
Import Org.apache.lucene.search.TopDocs;
Import Org.apache.lucene.store.Directory;

Import Org.apache.lucene.store.FSDirectory; /** * * * * @author LXY * */public class Readerbyindexertest {public static void search (String indexdi) through indexed fields
		
		R,string q) throws exception{//Get read index file path Directory dir=fsdirectory.open (Paths.get (Indexdir));
		
		Through dir get the path of all the files under Indexreader reader=directoryreader.open (dir);
		
		Set up index finder indexsearcher is=new indexsearcher (reader); Instantiate analyzer Analyzer=new StandardAnalyzer (); 
		
		Create a Query parser/** * The first parameter is the field to query; * The second parameter is the parser Analyzer * * */Queryparser parser=new queryparser ("Contents",
		
		Analyzer);
		
		Find Query Query=parser.parse (q) According to the incoming p;
		
		Calculates the index start time long Start=system.currenttimemillis ();
		
		Start Query/** * The first parameter is to find query by passing parameters; * The second parameter is the number of rows to be out of the query * */topdocs Hits=is.search (query, 10);
		
		Computed index end time long End=system.currenttimemillis ();
		
		System.out.println ("Match" +q+ ", Total cost" + (End-start) + "milliseconds" + "query to" +hits.totalhits+ "record");  Traverse Hits.scoredocs, get scoredoc/** * Scoredoc: Scoring document that gets document * Scoredocs: Represents topdocs this document array * @throws Exception *
			* for (Scoredoc scoreDoc:hits.scoreDocs) {Document doc=is.doc (scoredoc.doc);
		System.out.println (Doc.get ("FullPath"));
	//Close Reader reader.close ();
		//test public static void main (string[] args) {String indexdir= "E:\\lucenedemo";
		
		String q= "Zygmunt Saloni";
		try {search (INDEXDIR,Q); catch (Exception e) {//TODO auto-gEnerated Catch block E.printstacktrace ();
 }
	}
	
		
}
After reading the results of the index, there are several files that have this sentence in English. As follows:


Summarize:

The above is the simple document of Lucene to write the index, and then based on the index to Full-text search. See this, the students here must also understand Lucene's workflow, yes, is written and read. To write is to write an index to the document to retrieve, and to read, to search through the index for what is retrieved. So, when you encounter very large documents, you can use Lucene to retrieve, help find, and quickly get information.

Master the core code of the write index, and read the document based on the index.
Attention:

1, the index of the specified document path, in the code for E:\\lucenedemo, must be the correct path to the index file, otherwise the following exception occurs:
Org.apache.lucene.index.IndexNotFoundException:no segments* file found in Simplefsdirectory@d:\lucenedemo lockfactor Y=org.apache.lucene.store.nativefslockfactory@eb724:files: []
At Org.apache.lucene.index.segmentinfos$findsegmentsfile.run (segmentinfos.java:726)
At Org.apache.lucene.index.StandardDirectoryReader.open (standarddirectoryreader.java:50)
At Org.apache.lucene.index.DirectoryReader.open (directoryreader.java:63)
At ReaderByIndexer.ReaderByIndexerTest.search (readerbyindexertest.java:32)
At ReaderByIndexer.ReaderByIndexerTest.main (readerbyindexertest.java:87)
There are two of these problems: the first is that the files under the D:\\lucenedemo folder are not found. Occasionally, an exception with a null pointer appears. The second is not close indexer

2, after the completion of the index, want to test again, the file is indexed by the folder under the file first deleted, if not deleted, is to retrieve duplicate documents.
3, the test parameters and path must be determined to be completely correct, have it is to query the parameters, or you will find 0 documents.

4, this will need to retrieve the document must be all English.

5, pay attention to guide bag is correct.


Warm tips:

Simply writing the index to the specified document requires you to be proficient in the use of each core word. So this is where Lucene gets started, followed by a crud on the indexed document.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.