Integrate Lucene into Web Applications
Next, we will develop a web application that uses Lucene to retrieve HTML documents stored on the file server. Before you begin, prepare the following environment:
- Eclipse integrated development environment
- Tomcat 5.0
- Lucene Library
- JDK 1.5
In this example, we use eclipse for Web application development. In the end, this web application runs on Tomcat 5.0. After preparing the necessary environment for development, we will proceed with the development of Web applications.
1. Create a dynamic web project
- In eclipse, selectFile> New> ProjectAnd then selectDynamic Web project, As shown in figure 2.
Create a dynamic web project
- After creating a dynamic web project, you will see the structure of the created Project, as shown in Figure 3. The project name is sample. DW. Paper. Lucene.
Figure 3: Structure of a dynamic web project
2. design the architecture of the WEB Project
In our design, the system is divided into the following four subsystems:
- User Interface: This subsystem provides a user interface for users to submit search requests to the Web application server, and then the search results are displayed through the user interface. We use a page named search. jsp to implement this subsystem.
- Request Manager: This subsystem manages the search requests sent from the client and distributes the search requests to the search subsystem. Finally, the search result is returned from the search subsystem and finally sent to the user interface subsystem. We use a servlet to implement this subsystem.
- Search Subsystem: This subsystem is responsible for searching index files and passing the search structure to the request manager. We use the API provided by Lucene to implement this subsystem.
- Index Subsystem: This subsystem is used to create indexes for HTML pages. We use Lucene APIs and an HTML Parser provided by Lucene to create this subsystem.
Figure 4 shows the detailed information of our design. We put the user interface subsystem under the webcontent directory. You will see a page named search. jsp in this folder. Request Management Subsystem in the packagesample.dw.paper.lucene.servlet
BelowSearchController
Implement functions. Search subsystem in packagesample.dw.paper.lucene.search
It contains two classes,SearchManager
AndSearchResultBean
The first class is used to implement the search function, and the second class is used to describe the structure of the search result. Index subsystem in packagesample.dw.paper.lucene.index
. ClassIndexManager
Creates indexes for HTML files. This subsystem utilization packagesample.dw.paper.lucene.util
ClassHTMLDocParser
Provided MethodgetTitle
AndgetContent
To parse HTML pages.
Figure 4: Architecture Design of the project
3. Implementation of subsystems
After analyzing the system architecture design, let's look at the system implementation details.
- User Interface: This subsystem has a JSP file named search. jsp. This JSP page contains two parts. The first part provides a user interface to submit a search request to the Web application server, as shown in Figure 5. Note that the search request is sent to a servlet named searchcontroller. The correspondence between servlet names and specific implemented classes is specified in Web. xml.
Figure 5: Submit a search request to the Web server
The second part of this JSP is responsible for displaying the search result to the user, as shown in 6:
Figure 6: Display Search Results
- Request Manager:
SearchController
Servlet is used to implement this subsystem. Listing 6 shows the source code of this class.
Listing 6: Implementation of the Request Manager
package sample.dw.paper.lucene.servlet;import java.io.IOException;import java.util.List;import javax.servlet.RequestDispatcher;import javax.servlet.ServletException;import javax.servlet.http.HttpServlet;import javax.servlet.http.HttpServletRequest;import javax.servlet.http.HttpServletResponse;import sample.dw.paper.lucene.search.SearchManager;/** * This servlet is used to deal with the search request * and return the search results to the client */public class SearchController extends HttpServlet{ private static final long serialVersionUID = 1L; public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException{ String searchWord = request.getParameter("searchWord"); SearchManager searchManager = new SearchManager(searchWord); List searchResult = null; searchResult = searchManager.search(); RequestDispatcher dispatcher = request.getRequestDispatcher("search.jsp"); request.setAttribute("searchResult",searchResult); dispatcher.forward(request, response); } public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException{ doPost(request, response); }}
|
In Listing 6,doPost
Method to obtain the search term from the client and create a classSearchManager
An instance in which the classSearchManager
It is defined in the search subsystem. Then,SearchManager
. Finally, the search result is returned to the client.
- Search Subsystem: In this subsystem, we define two classes:
SearchManager
AndSearchResultBean
. The first class is used to implement the search function, and the second class is a Javabean to describe the structure of the search results. Listing 7 lists the classesSearchManager
Source code.
Listing 7: Implementation of the search function
package sample.dw.paper.lucene.search;import java.io.IOException;import java.util.ArrayList;import java.util.List;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.queryParser.ParseException;import org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.Hits;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import sample.dw.paper.lucene.index.IndexManager;/** * This class is used to search the * Lucene index and return search results */public class SearchManager { private String searchWord; private IndexManager indexManager; private Analyzer analyzer; public SearchManager(String searchWord){ this.searchWord = searchWord; this.indexManager = new IndexManager(); this.analyzer = new StandardAnalyzer(); } /** * do search */ public List search(){ List searchResult = new ArrayList(); if(false == indexManager.ifIndexExist()){ try { if(false == indexManager.createIndex()){ return searchResult; } } catch (IOException e) { e.printStackTrace(); return searchResult; } } IndexSearcher indexSearcher = null; try{ indexSearcher = new IndexSearcher(indexManager.getIndexDir()); }catch(IOException ioe){ ioe.printStackTrace(); } QueryParser queryParser = new QueryParser("content",analyzer); Query query = null; try { query = queryParser.parse(searchWord); } catch (ParseException e) { e.printStackTrace(); } if(null != query >> null != indexSearcher){ try { Hits hits = indexSearcher.search(query); for(int i = 0; i < hits.length(); i ++){ SearchResultBean resultBean = new SearchResultBean(); resultBean.setHtmlPath(hits.doc(i).get("path")); resultBean.setHtmlTitle(hits.doc(i).get("title")); searchResult.add(resultBean); } } catch (IOException e) { e.printStackTrace(); } } return searchResult; }}
|
In listing 7, note that this class has three private attributes. The first one issearchWord
, Representing the search term from the client. The second one isindexManager
Indicates the class defined in the index subsystem.IndexManager
. The third isanalyzer
The parser used to parse the search term. Now let's focus on the method.search
Above. This method first checks whether the index file already exists. if it already exists, search for the existing index. If it does not exist, first call the classIndexManager
To create an index, and then search for the newly created index. After the search result is returned, this method extracts the required attributes from the search result and generates a class for each search result.SearchResultBean
. FinallySearchResultBean
The instance is put in a list and returned to the request manager.
InSearchResultBean
Contains two attributes:htmlPath
AndhtmlTitle
And the get and set methods of these two attributes. This also means that our search results contain two attributes:htmlPath
AndhtmlTitle
, WherehtmlPath
Represents the path of the HTML file,htmlTitle
Indicates the title of an HTML file.
- Index Subsystem: Class
IndexManager
Used to implement this subsystem. Listing 8 shows the source code of this class.
Listing 8: Implementation of the index Subsystem
package sample.dw.paper.lucene.index;import java.io.File;import java.io.IOException;import java.io.Reader;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import sample.dw.paper.lucene.util.HTMLDocParser;/** * This class is used to create an index for HTML files * */public class IndexManager { //the directory that stores HTML files private final String dataDir = "c://dataDir"; //the directory that is used to store a Lucene index private final String indexDir = "c://indexDir"; /** * create index */ public boolean createIndex() throws IOException{ if(true == ifIndexExist()){ return true; } File dir = new File(dataDir); if(!dir.exists()){ return false; } File[] htmls = dir.listFiles(); Directory fsDirectory = FSDirectory.getDirectory(indexDir, true); Analyzer analyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(fsDirectory, analyzer, true); for(int i = 0; i < htmls.length; i++){ String htmlPath = htmls[i].getAbsolutePath(); if(htmlPath.endsWith(".html") || htmlPath.endsWith(".htm")){ addDocument(htmlPath, indexWriter); } } indexWriter.optimize(); indexWriter.close(); return true; } /** * Add one document to the Lucene index */ public void addDocument(String htmlPath, IndexWriter indexWriter){ HTMLDocParser htmlParser = new HTMLDocParser(htmlPath); String path = htmlParser.getPath(); String title = htmlParser.getTitle(); Reader content = htmlParser.getContent(); Document document = new Document(); document.add(new Field("path",path,Field.Store.YES,Field.Index.NO)); document.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED)); document.add(new Field("content",content)); try { indexWriter.addDocument(document); } catch (IOException e) { e.printStackTrace(); } } /** * judge if the index exists already */ public boolean ifIndexExist(){ File directory = new File(indexDir); if(0 < directory.listFiles().length){ return true; }else{ return false; } } public String getDataDir(){ return this.dataDir; } public String getIndexDir(){ return this.indexDir; }}
|
This class contains two private attributes:dataDir
AndindexDir
.dataDir
Indicates the path of the HTML page waiting for indexing,indexDir
Represents the path for storing Lucene index files. ClassIndexManager
Three methods are provided:createIndex
,addDocument
AndifIndexExist
. If the index does not exist, you can usecreateIndex
Create a new index.addDocument
Add a document to an index. In our scenario, a document is an HTML page. MethodaddDocument
Will callHTMLDocParser
Provides methods to parse HTML documents. You can use the last MethodifIndexExist
To determine whether Lucene indexes already exist.
Now let's take a look at the packagesample.dw.paper.lucene.util
ClassHTMLDocParser
. This class is used to extract text information from HTML files. This class contains three methods:getContent
,getTitle
AndgetPath
. The first method returns the text content marked with HTML, the second method returns the title of the HTML file, and the last method returns the path of the HTML file. Listing 9 shows the source code of this class.
Listing 9: HTML Parser
package sample.dw.paper.lucene.util;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.io.Reader;import java.io.UnsupportedEncodingException;import org.apache.lucene.demo.html.HTMLParser;public class HTMLDocParser { private String htmlPath; private HTMLParser htmlParser; public HTMLDocParser(String htmlPath){ this.htmlPath = htmlPath; initHtmlParser(); } private void initHtmlParser(){ InputStream inputStream = null; try { inputStream = new FileInputStream(htmlPath); } catch (FileNotFoundException e) { e.printStackTrace(); } if(null != inputStream){ try { htmlParser = new HTMLParser(new InputStreamReader(inputStream, "utf-8")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } } } public String getTitle(){ if(null != htmlParser){ try { return htmlParser.getTitle(); } catch (IOException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } } return ""; } public Reader getContent(){ if(null != htmlParser){ try { return htmlParser.getReader(); } catch (IOException e) { e.printStackTrace(); } } return null; } public String getPath(){ return this.htmlPath; }}
|
5. Run the application on Tomcat 5.0
Now we can run the developed application on Tomcat 5.0.
- Right-clickSearch. jspAnd then selectRun as> run on server, 7.
Figure 7: Configure Tomcat 5.0
- In the displayed window, selectTomcat V5.0 ServerAs the target Web application server, and then clickNext, As shown in Figure 8:
Figure 8: Select Tomcat 5.0
- Now you need to specify the path of Apache Tomcat 5.0 and JRE used to run the web application. Here, the JRE version you selected must be the same as the JRE version you used to compile the Java file. After configuration, clickFinish. See Figure 9.
Figure 9: complete Tomcat 5.0 Configuration
- After configuration, Tomcat runs automatically and compiles search. jsp and displays it to users. See Figure 10.
Figure 10: User Interface
- Enter the keyword "information" in the input box and clickSearchButton. Then, the search results are displayed, as shown in Figure 11.
Figure 11: Search Results
- Click the first link of the search result. The content of the linked page is displayed. As shown in Figure 12.
Figure 12: Details
Now we have successfully completed the development of the sample project, and successfully implemented the search and index functions using Lucene. You can download the source code of this project ).
Summary
Lucene provides flexible interfaces to facilitate the design of our web search applications. If you want to add the search function to your application, Lucene is a good choice. When designing your next application with the search function, you can use Lucene to provide the search function.