Search by Lucene

Source: Internet
Author: User
Tags java keywords

Some websites allow the software development community to publish developer guides, White Papers, FAQs [FAQs] andSource codeTo share information. As the amount of information increases, and several developers contribute their own knowledge base, the website provides a search engine to search for all existing information on the site. Although these search engines can search for text filesCodeStrict restrictions are imposed. The search engine regards the source code as a plain text file. Therefore, it is no different from the mature tool grep, which can process a large number of source files.

In this articleArticleI recommend using Lucene. It is based on Java's open-source search engine and searches for source code by extracting source code elements related to the index. Here, I only search for Java source code. However, Lucene can alsoProgramming LanguageSource code search.

This article provides a brief overview of the key aspects of search engines in the Lucene environment. For more details, see the resources section.

Overview
Lucene is one of the most popular open-source search engine libraries. It consists of core APIs for text indexing and search. Lucene can create indexes for a set of text files and allows you to search for these indexes using complex queries, such as: + title: Lucene-content: search, search and Lucene, + SEARCH + code. Before entering the search details, let me introduce some Lucene functions.

Index text in Lucene

The search engine scans all the data to be searched and stores it in a structure that can be effectively obtained. This most famous structure is called inverted index. For example, you want to index a group of meeting records. First, each meeting record file is divided into several independent parts or fields, such as title, author, email, abstract, and content. Second, the content of each domain is marked and keywords or terms are extracted. In this way, you can create an inverted index for meeting records as shown in the following table.

....

For each term in the domain, two aspects are stored: the number of terms that appear in the file (that is, the frequency [DF ]) and the ID of each file containing the term. Other details stored for each term: for example, the number of times that a term appears in each file and the location where it appears are also saved. In any case, it is very important for us to know that using Lucene to retrieve files means saving them into a specific format that allows efficient query and retrieval.

Analyze the indexed text

Lucene uses analyzer to process indexed text. Before being indexed, the analyzer marks text, extracts related words, discards common words, and processes acronyms (restores acronyms to the root form, it means restoring bowling, Bowler, and bowls to bowl) and doing other work. The common analyzer provided by Lucene is:
Simpleanalyzer: Mark a group of words with strings and convert them to lowercase letters.
Standardanalyzer: Mark a group of words with strings to identify acronyms, email addresses, host names, and so on. And discard the stop words (a, an, the, to) based on English, and process acronyms.

Search (search index)
After the index structure is created, you can specify the fields to be searched and the terms to construct a complex query to retrieve the index. For example, the user queries Abstract: System and Email: The abc@mit.edu results in all files that contain System in the abstract and have a abc@mit.edu in the e-mail address. That is to say, if you search for the index in the inverted table, doc15 is returned. Files that match the query are listed according to the number of times the term appears in the file and the number of documents containing the term. Lucene executes an Ordered Arrangement Mechanism and provides us with the elasticity to change it.

Source Code Search Engine

Now we know the basic points about the search engine. Let's take a look at how to implement the search engine used for searching source code. This section describes the following Java classes when searching for Java sample code:
Inherit a specific class or implement an interface.
Call a specific method.
Use a specific Java class.

The combination of the above parts can meet the needs of developers to obtain the code they are looking. Therefore, the search engine should allow developers to perform single or combined queries on these aspects. Ides [integrated development environment] has another limitation: most of the tools available only support source code search based on one of the above standards. In search, there is a lack of flexibility to combine these criteria for query.

Now we start to build a source code search engine that supports these requirements.

Compile Source Code Analyzer
The first step is to write an analyzer to extract or remove source code elements and ensure that the best index is created and only relevant code is included. Keywords in Java-public, null, for, if, and so on. these keywords are similar to the common words (the, a, an, of) in English ). Therefore, the analyzer must remove these keywords from the index.

We create a Java source code analyzer by inheriting Lucene's abstract class analyzer. The source code of the javasourcecodeanalyzer class is listed below, which implements the tokenstream (string, Reader) method. This class defines a set of [Stop Words], which can be removed using the stopfilter class provided by Lucene during the indexing process. The tokenstream method is used to check the indexed fields. If the field is "comment", use the lowercasetokenizer class to mark the input items and convert them to lowercase letters, then, the stopfilter class is used to remove the [Stop Words] (a limited group of English [Stop Words]) in English, and porterstemfilter is used to remove the general syntax and suffix. If the indexed content is not "comment", the analyzer uses the lowercasetokenizer class to mark the input items and convert them to lower-case letters, and removes the Java keywords using the stopfilter class.

Package com. Infosys. Lucene. Code implements ourcecodeanalyzer .;

Import java. Io. reader;
Import java. util. Set;
Import org. Apache. Lucene. analysis .*;

Public class extends ourcecodeanalyzer extends analyzer {
Private set javastopset;
Private set englishstopset;
Private Static final string [] java_stop_words = {
"Public", "private", "protected", "interface ",
"Abstract", "implements", "extends", "null" "new ",
"Switch", "case", "default", "synchronized ",
"Do", "if", "else", "break", "continue", "this ",
"Assert", "for", "instanceof", "transient ",
"Final", "static", "Void", "catch", "try ",
"Throws", "Throw", "class", "finally", "Return ",
"Const", "native", "super", "while", "import ",
"Package", "true", "false "};
Private Static final string [] english_stop_words = {
"A", "an", "and", "are", "as", "at", "be" "",
"By", "for", "if", "in", "into", "is", "it ",
"No", "not", "of", "on", "or", "S", "Such ",
"That", "the", "their", "then", "there", "These ",
"They", "this", "to", "was", "will", ""};
Public sourcecodeanalyzer (){
Super ();
Javastopset = stopfilter. makestopset (java_stop_words );
Englishstopset = stopfilter. makestopset (english_stop_words );
}
Public tokenstream (string fieldname, reader ){
If (fieldname. Equals ("comment "))
Return new porterstemfilter (New stopfilter (
New lowercasetokenizer (Reader), englishstopset ));
Else
Return new stopfilter (
New lowercasetokenizer (Reader), javastopset );
}
}

Compile javasourcecodeindexer class
Step 2: generate an index. Very important classes used to create indexes include indexwriter, analyzer, document, and field. Create a document instance of Lucene for each source code file. Parse the source code file and extract the code-related syntax elements, including: import declarations, class names, inherited classes, Implemented interfaces, implemented methods, parameters used by methods, and code for each method. Add these syntactic elements to each independent field instance in the document instance. Then, use the indexwriter instance that stores the index to add the document instance to the index.

The source code of the javasourcecodeindexer class is listed below. This class uses the javaparser class to parse java files and extract syntax elements. You can also use eclipse3.0 astparser. The details of the javaparser class will not be explored here, because other parsers can also be used to extract relevant source code elements. When extracting elements from the source code file, create a filed instance and add it to the document instance.



Lucene has four different field types: keyword, unindexed, unstored, and text, which are used to specify the optimal index.
The keyword field is the part that does not require analyzer resolution but needs to be indexed and saved to the index. The javasourcecodeindexer class uses this field to save the declaration of the import class.
The unindexed field is neither analyzed nor indexed, but is saved to the index by words. Because we usually want to store the location of the file, but rarely use the file name as the keyword to search, so we use this field to index the Java file name.
The unstored field is opposite to the unindexed field. This type of field will be analyzed and indexed, but its value will not be saved to the index. Because the entire source code of the storage method requires a lot of space. Therefore, the unstored field is used to store the source code of the indexed method. The source code of the method can be retrieved directly from the Java source file, which can control the size of our index.
Text fields are analyzed, indexed, and saved during the indexing process. The class name is saved as a text field. The following table shows the general situation of using field fields in the javasourcecodeindexer class.



1.
You can use Luke to preview and modify indexes created with Lucene. Luke is an open-source tool for understanding indexes. Figure 1 shows an index created by the javasourcecodeindexer class.


Figure 1: Index in Luke

As you can see, the declaration of the import class is saved without being marked or analyzed. Class Name and method name are converted to lowercase letters before they are saved.

Query Java source code
After creating a multi-field index, you can use Lucene to query these indexes. It provides two important classes: indexsearcher and queryparser for searching files. The queryparser class is used to parse the query expression entered by the user, and the indexsearcher class searches for results that meet the query conditions in the file. The following table lists possible queries and their meanings:


You can index different syntax elements to form a valid query condition and search for code. The following lists the sample code used for search.
Public class javacodesearch {
Public static void main (string [] ARGs) throws exception {
File indexdir = new file (ARGs [0]);
String q = ARGs [1]; // parameter: jgraph code: insert
Directory fsdir = fsdirectory. getdirectory (indexdir, false );
Indexsearcher is = new indexsearcher (fsdir );

Perfieldanalyzerwrapper analyzer = new
Perfieldanalyzerwrapper (New
Javasourcecodeanalyzer ());

Analyzer. addanalyzer ("import", new keywordanalyzer ());
Query query = queryparser. parse (Q, "code", analyzer );
Long start = system. currenttimemillis ();
Hits hits = is. Search (query );
Long end = system. currenttimemillis ();
System. Err. println ("found" + hits. Length () +
"Docs in" + (end-Start) + "millisec ");
For (INT I = 0; I Document Doc = hits.doc (I );
System. Out. println (Doc. Get ("FILENAME ")
+ "With a score of" + hits. Score (I ));
}
Is. Close ();
}
}

The indexsearcher instance uses fsdirectory to open the directory containing the index. The analyzer instance is then used to analyze the query string used for search to ensure that it is in the same form as the index (restoring the root, converting lowercase letters, filtering out, and so on. Lucene imposes some restrictions to avoid using field as a keyword index during queries. Lucene uses analyzer to analyze all fields passed to it in the queryparser instance. To solve this problem, you can use the perfieldanalyzerwrapper class provided by Lucene to specify the necessary analysis for each field in the query. Therefore, the query string import: org. W3C. * And code: document will use keywordanalyzer to parse the string org. W3C. * and parse ourcecodeanalyzer to parse the document. If the queryparser instance does not match the query field, use the default field code and perfieldanalyzerwrapper to analyze the query string and return the analyzed query instance. The indexsearcher instance uses the query instance and returns an hits instance, which contains files that meet the query conditions.

Conclusion

This article introduces Lucene-text search engine, which can achieve source code search by loading analyzer and multi-field index. This article only introduces the basic functions of the Code Search Engine. At the same time, you can use a more sophisticated analyzer in the source code search to improve the search performance and obtain better query results. This search engine allows users to developCommunitySearch and share source code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.