Use Lucene to search Java source code (1)

Last Update:2018-12-05 Source: Internet

Author: User

Tags java keywords

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some websites allow the software development community to share information by releasing developer guides, White Papers, FAQs [FAQ], and source code. As the amount of information increases, and several developers contribute their own knowledge base, the website provides a search engine to search for all existing information on the site. Although these search engines can search for text files, they impose strict restrictions on developers' source code. The search engine regards the source code as a plain text file. Therefore, it is no different from the mature tool grep, which can process a large number of source files.

In this article, we recommend using Lucene, an open-source search engine based on Java, to search source code by extracting source code elements related to the index. Here, I only search for Java source code. However, Lucene can also search the source code of other programming languages.

This article provides a brief overview of the key aspects of search engines in the Lucene environment.

Overview
Lucene is one of the most popular open-source search engine libraries. It consists of core APIs for text indexing and search. Lucene can create indexes for a set of text files and allows you to search for these indexes using complex queries, such as: + title: Lucene-content: search, search and Lucene, + SEARCH + code. Before entering the search details, let me introduce some Lucene functions.

Index text in Lucene

The search engine scans all the data to be searched and stores it in a structure that can be effectively obtained. This most famous structure is called inverted index. For example, you want to index a group of meeting records. First, each meeting record file is divided into several independent parts or fields, such as title, author, email, abstract, and content. Second, the content of each domain is marked and keywords or terms are extracted. In this way, you can create an inverted index for meeting records as shown in the following table.

....

For each term in the domain, two aspects are stored: the number of terms that appear in the file (that is, the frequency [DF ]) and the ID of each file containing the term. Other details stored for each term: for example, the number of times that a term appears in each file and the location where it appears are also saved. In any case, it is very important for us to know that using Lucene to retrieve files means saving them into a specific format that allows efficient query and retrieval.

Analyze the indexed text

Lucene uses analyzer to process indexed text. Before being indexed, the analyzer marks text, extracts related words, discards common words, and processes acronyms (restores acronyms to the root form, it means restoring bowling, Bowler, and bowls to bowl) and doing other work. The common analyzer provided by Lucene is:
& #61548; simpleanalyzer: Mark a group of words with strings and convert them to lowercase letters.
& #61548; standardanalyzer: Mark a group of words with strings to identify acronyms, email addresses, host names, and so on. And discard the stop words (a, an, the, to) based on English, and process acronyms.

Search (search index)
After the index structure is created, you can specify the fields to be searched and the terms to construct a complex query to retrieve the index. For example, the user queries Abstract: System and Email: The abc@mit.edu results in all files that contain System in the abstract and have a abc@mit.edu in the e-mail address. That is to say, if you search for the index in the inverted table, doc15 is returned. Files that match the query are listed according to the number of times the term appears in the file and the number of documents containing the term. Lucene executes an Ordered Arrangement Mechanism and provides us with the elasticity to change it.

Source Code Search Engine

Now we know the basic points about the search engine. Let's take a look at how to implement the search engine used for searching source code. This section describes the following Java classes when searching for Java sample code:
Inherit a specific class or implement an interface.
Call a specific method.
Use a specific Java class.

The combination of the above parts can meet the needs of developers to obtain the code they are looking. Therefore, the search engine should allow developers to perform single or combined queries on these aspects. Ides [integrated development environment] has another limitation: most of the tools available only support source code search based on one of the above standards. In search, there is a lack of flexibility to combine these criteria for query.

Now we start to build a source code search engine that supports these requirements.

Compile Source Code Analyzer
The first step is to write an analyzer to extract or remove source code elements and ensure that the best index is created and only relevant code is included. Keywords in Java-public, null, for, if, and so on. these keywords are similar to the common words (the, a, an, of) in English ). Therefore, the analyzer must remove these keywords from the index.

We create a Java source code analyzer by inheriting Lucene's abstract class analyzer. The source code of the javasourcecodeanalyzer class is listed below, which implements the tokenstream (string, Reader) method. This class defines a set of [Stop Words], which can be removed using the stopfilter class provided by Lucene during the indexing process. The tokenstream method is used to check the indexed fields. If the field is "comment", you must first use the lowercasetokenizer class to mark the input items and convert them to lowercase letters, and then use the stopfilter class to remove the [Stop
Words (a limited group of English [Stop Words]), and then use porterstemfilter to remove the general syntax and suffix. If the indexed content is not "comment", the analyzer uses the lowercasetokenizer class to mark the input items and convert them to lower-case letters, and removes the Java keywords using the stopfilter class.

package com.infosys.lucene.code JavaSourceCodeAnalyzer.;import java.io.Reader;import java.util.Set;import org.apache.lucene.analysis.*;public class JavaSourceCodeAnalyzer extends Analyzer {       private Set javaStopSet;       private Set englishStopSet;       private static final String[] JAVA_STOP_WORDS = {          "public","private","protected","interface",             "abstract","implements","extends","null""new",             "switch","case", "default" ,"synchronized" ,             "do", "if", "else", "break","continue","this",             "assert" ,"for","instanceof", "transient",             "final", "static" ,"void","catch","try",             "throws","throw","class", "finally","return",             "const" , "native", "super","while", "import",             "package" ,"true", "false" };       private static final String[] ENGLISH_STOP_WORDS ={             "a", "an", "and", "are","as","at","be" "but",             "by", "for", "if", "in", "into", "is", "it",             "no", "not", "of", "on", "or", "s", "such",             "that", "the", "their", "then", "there","these",             "they", "this", "to", "was", "will", "with" };      public SourceCodeAnalyzer(){             super();             javaStopSet = StopFilter.makeStopSet(JAVA_STOP_WORDS);             englishStopSet = StopFilter.makeStopSet(ENGLISH_STOP_WORDS);      }      public TokenStream tokenStream(String fieldName, Reader reader) {             if (fieldName.equals("comment"))                      return    new PorterStemFilter(new StopFilter(                         new LowerCaseTokenizer(reader),englishStopSet));             else                      return    new StopFilter(                    new LowerCaseTokenizer(reader),javaStopSet);      }}

Compile javasourcecodeindexer class
Step 2: generate an index. Very important classes used to create indexes include indexwriter, analyzer, document, and field. Create a document instance of Lucene for each source code file. Parse the source code file and extract the code-related syntax elements, including: import declarations, class names, inherited classes, Implemented interfaces, implemented methods, parameters used by methods, and code for each method. Add these syntactic elements to each independent field instance in the document instance. Then, use the indexwriter instance that stores the index to add the document instance to the index.

The source code of the javasourcecodeindexer class is listed below. This class uses the javaparser class to parse java files and extract syntax elements. You can also use eclipse3.0 astparser. The details of the javaparser class will not be explored here, because other parsers can also be used to extract relevant source code elements. When extracting elements from the source code file, create a filed instance and add it to the document instance.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More