Use Lucene to search Java source code (2)

Source: Internet
Author: User
import org.apache.lucene.document.*;import org.apache.lucene.index.*;import com.infosys.lucene.code.JavaParser.*;public class JavaSourceCodeIndexer {     private static JavaParser parser = new JavaParser();         private static final String IMPLEMENTS = "implements";         private static final String IMPORT = "import";         ...         public static void main(String[] args) {                 File indexDir = new File("C:\\Lucene\\Java");                 File dataDir = new File("C:\\JavaSourceCode ");                 IndexWriter writer = new IndexWriter(indexDir,                     new JavaSourceCodeAnalyzer(), true);                 indexDirectory(writer, dataDir);                 writer.close();         }         public static void indexDirectory(IndexWriter writer, File dir){             File[] files = dir.listFiles();             for (int i = 0; i < files.length; i++) {                     File f = files[i];                 // Create a Lucene Document                 Document doc = new Document();                 //   Use JavaParser to parse file                 parser.setSource(f);                 addImportDeclarations(doc, parser);                           addComments(doc, parser);                   // Extract Class elements Using Parser                 JClass cls = parser.getDeclaredClass();                 addClass(doc, cls);                   // Add field to the Lucene Document                         doc.add(Field.UnIndexed(FILENAME, f.getName()));                 writer.addDocument(doc);               }         }         private static void addClass(Document doc, JClass cls) {                     //For each class add Class Name field             doc.add(Field.Text(CLASS, cls.className));             String superCls = cls.superClass;             if (superCls != null)                     //Add the class it extends as extends field               doc.add(Field.Text(EXTENDS, superCls));             // Add interfaces it implements             ArrayList interfaces = cls.interfaces;             for (int i = 0; i < interfaces.size(); i++)                 doc.add(Field.Text(IMPLEMENTS, (String) interfaces.get(i)));                       //Add details   on methods declared             addMethods(cls, doc);             ArrayList innerCls = cls.innerClasses;                     for (int i = 0; i < innerCls.size(); i++)                 addClass(doc, (JClass) innerCls.get(i));          }         private static void addMethods(JClass cls, Document doc) {             ArrayList methods = cls.methodDeclarations;             for (int i = 0; i < methods.size(); i++) {                         JMethod method = (JMethod) methods.get(i);                 // Add method name field                 doc.add(Field.Text(METHOD, method.methodName));                 // Add return type field                 doc.add(Field.Text(RETURN, method.returnType));                 ArrayList params = method.parameters;                 for (int k = 0; k < params.size(); k++)                 // For each method add parameter types                     doc.add(Field.Text(PARAMETER, (String)params.get(k)));                 String code = method.codeBlock;                 if (code != null)                 //add the method code block                     doc.add(Field.UnStored(CODE, code));             }         }         private static void addImportDeclarations(Document doc, JavaParser parser) {                     ArrayList imports = parser.getImportDeclarations();             if (imports == null)      return;             for (int i = 0; i < imports.size(); i++)                     //add import declarations as keyword                 doc.add(Field.Keyword(IMPORT, (String) imports.get(i)));         }}

Lucene has four different field types: keyword, unindexed, unstored, and text, which are used to specify the optimal index.
The keyword field is the part that does not require analyzer resolution but needs to be indexed and saved to the index. The javasourcecodeindexer class uses this field to save the declaration of the import class.

The unindexed field is neither analyzed nor indexed, but is saved to the index by words. Because we usually want to store the location of the file, but rarely use the file name as the keyword to search, so we use this field to index the Java file name.

The unstored field is opposite to the unindexed field. This type of field will be analyzed and indexed, but its value will not be saved to the index. Because the entire source code of the storage method requires a lot of space. Therefore, the unstored field is used to store the source code of the indexed method. The source code of the method can be retrieved directly from the Java source file, which can control the size of our index.

Text fields are analyzed, indexed, and saved during the indexing process. The class name is saved as a text field. The following table shows the general situation of using field fields in the javasourcecodeindexer class.

1.
You can use Luke to preview and modify indexes created with Lucene. Luke is an open-source tool for understanding indexes. Figure 1 shows an index created by the javasourcecodeindexer class.

Figure 1: Index in Luke

As you can see, the declaration of the import class is saved without being marked or analyzed. Class Name and method name are converted to lowercase letters before they are saved.

Query Java source code
After creating a multi-field index, you can use Lucene to query these indexes. It provides two important classes: indexsearcher and queryparser for searching files. The queryparser class is used to parse the query expression entered by the user, and the indexsearcher class searches for results that meet the query conditions in the file. The following table lists possible queries and their meanings:

You can index different syntax elements to form a valid query condition and search for code. The following lists the sample code used for search.

public class JavaCodeSearch {public static void main(String[] args) throws Exception{     File indexDir = new File(args[0]);     String q =   args[1]; //parameter:JGraph code:insert     Directory fsDir = FSDirectory.getDirectory(indexDir,false);     IndexSearcher is = new IndexSearcher(fsDir);     PerFieldAnalyzerWrapper analyzer = new         PerFieldAnalyzerWrapper( new                 JavaSourceCodeAnalyzer());     analyzer.addAnalyzer("import", new KeywordAnalyzer());     Query query = QueryParser.parse(q, "code", analyzer);     long start = System.currentTimeMillis();     Hits hits = is.search(query);     long end = System.currentTimeMillis();     System.err.println("Found " + hits.length() +                 " docs in " + (end-start) + " millisec");     for(int i = 0; i < hits.length(); i++){     Document doc = hits.doc(i);         System.out.println(doc.get("filename")                 + " with a score of " + hits.score(i));     }     is.close();}}

The indexsearcher instance uses fsdirectory to open the directory containing the index. The analyzer instance is then used to analyze the query string used for search to ensure that it is in the same form as the index (restoring the root, converting lowercase letters, filtering out, and so on. Lucene imposes some restrictions to avoid using field as a keyword index during queries. Lucene uses analyzer to analyze all fields passed to it in the queryparser instance. To solve this problem, you can use the perfieldanalyzerwrapper class provided by Lucene to specify the necessary analysis for each field in the query. Therefore, the query string import: org. W3C .*
And code: document will use keywordanalyzer to parse the string org. W3C. * And use javasourcecodeanalyzer to parse the document. If the queryparser instance does not match the query field, use the default field code and perfieldanalyzerwrapper to analyze the query string and return the analyzed query instance. The indexsearcher instance uses the query instance and returns an hits instance, which contains files that meet the query conditions.

Conclusion

This article introduces Lucene-text search engine, which can achieve source code search by loading analyzer and multi-field index. This article only introduces the basic functions of the Code Search Engine. At the same time, you can use a more sophisticated analyzer in the source code search to improve the search performance and obtain better query results. This search engine allows users to search and share source code in the software development community.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.