import org.apache.lucene.document.*;import org.apache.lucene.index.*;import com.infosys.lucene.code.JavaParser.*;public class JavaSourceCodeIndexer { private static JavaParser parser = new JavaParser(); private static final String IMPLEMENTS = "implements"; private static final String IMPORT = "import"; ... public static void main(String[] args) { File indexDir = new File("C:\\Lucene\\Java"); File dataDir = new File("C:\\JavaSourceCode "); IndexWriter writer = new IndexWriter(indexDir, new JavaSourceCodeAnalyzer(), true); indexDirectory(writer, dataDir); writer.close(); } public static void indexDirectory(IndexWriter writer, File dir){ File[] files = dir.listFiles(); for (int i = 0; i < files.length; i++) { File f = files[i]; // Create a Lucene Document Document doc = new Document(); // Use JavaParser to parse file parser.setSource(f); addImportDeclarations(doc, parser); addComments(doc, parser); // Extract Class elements Using Parser JClass cls = parser.getDeclaredClass(); addClass(doc, cls); // Add field to the Lucene Document doc.add(Field.UnIndexed(FILENAME, f.getName())); writer.addDocument(doc); } } private static void addClass(Document doc, JClass cls) { //For each class add Class Name field doc.add(Field.Text(CLASS, cls.className)); String superCls = cls.superClass; if (superCls != null) //Add the class it extends as extends field doc.add(Field.Text(EXTENDS, superCls)); // Add interfaces it implements ArrayList interfaces = cls.interfaces; for (int i = 0; i < interfaces.size(); i++) doc.add(Field.Text(IMPLEMENTS, (String) interfaces.get(i))); //Add details on methods declared addMethods(cls, doc); ArrayList innerCls = cls.innerClasses; for (int i = 0; i < innerCls.size(); i++) addClass(doc, (JClass) innerCls.get(i)); } private static void addMethods(JClass cls, Document doc) { ArrayList methods = cls.methodDeclarations; for (int i = 0; i < methods.size(); i++) { JMethod method = (JMethod) methods.get(i); // Add method name field doc.add(Field.Text(METHOD, method.methodName)); // Add return type field doc.add(Field.Text(RETURN, method.returnType)); ArrayList params = method.parameters; for (int k = 0; k < params.size(); k++) // For each method add parameter types doc.add(Field.Text(PARAMETER, (String)params.get(k))); String code = method.codeBlock; if (code != null) //add the method code block doc.add(Field.UnStored(CODE, code)); } } private static void addImportDeclarations(Document doc, JavaParser parser) { ArrayList imports = parser.getImportDeclarations(); if (imports == null) return; for (int i = 0; i < imports.size(); i++) //add import declarations as keyword doc.add(Field.Keyword(IMPORT, (String) imports.get(i))); }}
Lucene has four different field types: keyword, unindexed, unstored, and text, which are used to specify the optimal index.
The keyword field is the part that does not require analyzer resolution but needs to be indexed and saved to the index. The javasourcecodeindexer class uses this field to save the declaration of the import class.
The unindexed field is neither analyzed nor indexed, but is saved to the index by words. Because we usually want to store the location of the file, but rarely use the file name as the keyword to search, so we use this field to index the Java file name.
The unstored field is opposite to the unindexed field. This type of field will be analyzed and indexed, but its value will not be saved to the index. Because the entire source code of the storage method requires a lot of space. Therefore, the unstored field is used to store the source code of the indexed method. The source code of the method can be retrieved directly from the Java source file, which can control the size of our index.
Text fields are analyzed, indexed, and saved during the indexing process. The class name is saved as a text field. The following table shows the general situation of using field fields in the javasourcecodeindexer class.
1.
You can use Luke to preview and modify indexes created with Lucene. Luke is an open-source tool for understanding indexes. Figure 1 shows an index created by the javasourcecodeindexer class.
Figure 1: Index in Luke
As you can see, the declaration of the import class is saved without being marked or analyzed. Class Name and method name are converted to lowercase letters before they are saved.
Query Java source code
After creating a multi-field index, you can use Lucene to query these indexes. It provides two important classes: indexsearcher and queryparser for searching files. The queryparser class is used to parse the query expression entered by the user, and the indexsearcher class searches for results that meet the query conditions in the file. The following table lists possible queries and their meanings:
You can index different syntax elements to form a valid query condition and search for code. The following lists the sample code used for search.
public class JavaCodeSearch {public static void main(String[] args) throws Exception{ File indexDir = new File(args[0]); String q = args[1]; //parameter:JGraph code:insert Directory fsDir = FSDirectory.getDirectory(indexDir,false); IndexSearcher is = new IndexSearcher(fsDir); PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper( new JavaSourceCodeAnalyzer()); analyzer.addAnalyzer("import", new KeywordAnalyzer()); Query query = QueryParser.parse(q, "code", analyzer); long start = System.currentTimeMillis(); Hits hits = is.search(query); long end = System.currentTimeMillis(); System.err.println("Found " + hits.length() + " docs in " + (end-start) + " millisec"); for(int i = 0; i < hits.length(); i++){ Document doc = hits.doc(i); System.out.println(doc.get("filename") + " with a score of " + hits.score(i)); } is.close();}}
The indexsearcher instance uses fsdirectory to open the directory containing the index. The analyzer instance is then used to analyze the query string used for search to ensure that it is in the same form as the index (restoring the root, converting lowercase letters, filtering out, and so on. Lucene imposes some restrictions to avoid using field as a keyword index during queries. Lucene uses analyzer to analyze all fields passed to it in the queryparser instance. To solve this problem, you can use the perfieldanalyzerwrapper class provided by Lucene to specify the necessary analysis for each field in the query. Therefore, the query string import: org. W3C .*
And code: document will use keywordanalyzer to parse the string org. W3C. * And use javasourcecodeanalyzer to parse the document. If the queryparser instance does not match the query field, use the default field code and perfieldanalyzerwrapper to analyze the query string and return the analyzed query instance. The indexsearcher instance uses the query instance and returns an hits instance, which contains files that meet the query conditions.
Conclusion
This article introduces Lucene-text search engine, which can achieve source code search by loading analyzer and multi-field index. This article only introduces the basic functions of the Code Search Engine. At the same time, you can use a more sophisticated analyzer in the source code search to improve the search performance and obtain better query results. This search engine allows users to search and share source code in the software development community.