Integrate Lucene into Web Applications

Source: Internet
Author: User
Tags createindex apache tomcat

Integrate Lucene into Web Applications

Next, we will develop a web application that uses Lucene to retrieve HTML documents stored on the file server. Before you begin, prepare the following environment:

  1. Eclipse integrated development environment
  2. Tomcat 5.0
  3. Lucene Library
  4. JDK 1.5

In this example, we use eclipse for Web application development. In the end, this web application runs on Tomcat 5.0. After preparing the necessary environment for development, we will proceed with the development of Web applications.

1. Create a dynamic web project

  1. In eclipse, selectFile> New> ProjectAnd then selectDynamic Web project, As shown in figure 2.

Create a dynamic web project

 

  1. After creating a dynamic web project, you will see the structure of the created Project, as shown in Figure 3. The project name is sample. DW. Paper. Lucene.

Figure 3: Structure of a dynamic web project

 

2. design the architecture of the WEB Project

In our design, the system is divided into the following four subsystems:

  1. User Interface: This subsystem provides a user interface for users to submit search requests to the Web application server, and then the search results are displayed through the user interface. We use a page named search. jsp to implement this subsystem.
  2. Request Manager: This subsystem manages the search requests sent from the client and distributes the search requests to the search subsystem. Finally, the search result is returned from the search subsystem and finally sent to the user interface subsystem. We use a servlet to implement this subsystem.
  3. Search Subsystem: This subsystem is responsible for searching index files and passing the search structure to the request manager. We use the API provided by Lucene to implement this subsystem.
  4. Index Subsystem: This subsystem is used to create indexes for HTML pages. We use Lucene APIs and an HTML Parser provided by Lucene to create this subsystem.

Figure 4 shows the detailed information of our design. We put the user interface subsystem under the webcontent directory. You will see a page named search. jsp in this folder. Request Management Subsystem in the packagesample.dw.paper.lucene.servletBelowSearchControllerImplement functions. Search subsystem in packagesample.dw.paper.lucene.searchIt contains two classes,SearchManagerAndSearchResultBeanThe first class is used to implement the search function, and the second class is used to describe the structure of the search result. Index subsystem in packagesample.dw.paper.lucene.index. ClassIndexManagerCreates indexes for HTML files. This subsystem utilization packagesample.dw.paper.lucene.utilClassHTMLDocParserProvided MethodgetTitleAndgetContentTo parse HTML pages.

Figure 4: Architecture Design of the project

 

3. Implementation of subsystems

After analyzing the system architecture design, let's look at the system implementation details.

  1. User Interface: This subsystem has a JSP file named search. jsp. This JSP page contains two parts. The first part provides a user interface to submit a search request to the Web application server, as shown in Figure 5. Note that the search request is sent to a servlet named searchcontroller. The correspondence between servlet names and specific implemented classes is specified in Web. xml.

Figure 5: Submit a search request to the Web server

 

The second part of this JSP is responsible for displaying the search result to the user, as shown in 6:

Figure 6: Display Search Results

  1. Request Manager:SearchControllerServlet is used to implement this subsystem. Listing 6 shows the source code of this class.

Listing 6: Implementation of the Request Manager

package sample.dw.paper.lucene.servlet;import java.io.IOException;import java.util.List;import javax.servlet.RequestDispatcher;import javax.servlet.ServletException;import javax.servlet.http.HttpServlet;import javax.servlet.http.HttpServletRequest;import javax.servlet.http.HttpServletResponse;import sample.dw.paper.lucene.search.SearchManager;/** * This servlet is used to deal with the search request * and return the search results to the client */public class SearchController extends HttpServlet{    private static final long serialVersionUID = 1L;    public void doPost(HttpServletRequest request, HttpServletResponse response)                      throws IOException, ServletException{        String searchWord = request.getParameter("searchWord");        SearchManager searchManager = new SearchManager(searchWord);        List searchResult = null;        searchResult = searchManager.search();        RequestDispatcher dispatcher = request.getRequestDispatcher("search.jsp");        request.setAttribute("searchResult",searchResult);        dispatcher.forward(request, response);    }    public void doGet(HttpServletRequest request, HttpServletResponse response)                     throws IOException, ServletException{        doPost(request, response);    }}

In Listing 6,doPostMethod to obtain the search term from the client and create a classSearchManagerAn instance in which the classSearchManagerIt is defined in the search subsystem. Then,SearchManager. Finally, the search result is returned to the client.

  1. Search Subsystem: In this subsystem, we define two classes:SearchManagerAndSearchResultBean. The first class is used to implement the search function, and the second class is a Javabean to describe the structure of the search results. Listing 7 lists the classesSearchManagerSource code.

Listing 7: Implementation of the search function

package sample.dw.paper.lucene.search;import java.io.IOException;import java.util.ArrayList;import java.util.List;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.queryParser.ParseException;import org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.Hits;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import sample.dw.paper.lucene.index.IndexManager;/** * This class is used to search the  * Lucene index and return search results */public class SearchManager {    private String searchWord;        private IndexManager indexManager;        private Analyzer analyzer;        public SearchManager(String searchWord){        this.searchWord   =  searchWord;        this.indexManager =  new IndexManager();        this.analyzer     =  new StandardAnalyzer();    }        /**     * do search     */    public List search(){        List searchResult = new ArrayList();        if(false == indexManager.ifIndexExist()){        try {            if(false == indexManager.createIndex()){                return searchResult;            }        } catch (IOException e) {          e.printStackTrace();          return searchResult;        }        }            IndexSearcher indexSearcher = null;        try{            indexSearcher = new IndexSearcher(indexManager.getIndexDir());        }catch(IOException ioe){            ioe.printStackTrace();        }        QueryParser queryParser = new QueryParser("content",analyzer);        Query query = null;        try {            query = queryParser.parse(searchWord);        } catch (ParseException e) {          e.printStackTrace();        }        if(null != query >> null != indexSearcher){            try {                Hits hits = indexSearcher.search(query);                for(int i = 0; i < hits.length(); i ++){                    SearchResultBean resultBean = new SearchResultBean();                    resultBean.setHtmlPath(hits.doc(i).get("path"));                    resultBean.setHtmlTitle(hits.doc(i).get("title"));                    searchResult.add(resultBean);                }            } catch (IOException e) {                e.printStackTrace();            }        }        return searchResult;    }} 

In listing 7, note that this class has three private attributes. The first one issearchWord, Representing the search term from the client. The second one isindexManagerIndicates the class defined in the index subsystem.IndexManager. The third isanalyzerThe parser used to parse the search term. Now let's focus on the method.searchAbove. This method first checks whether the index file already exists. if it already exists, search for the existing index. If it does not exist, first call the classIndexManagerTo create an index, and then search for the newly created index. After the search result is returned, this method extracts the required attributes from the search result and generates a class for each search result.SearchResultBean. FinallySearchResultBeanThe instance is put in a list and returned to the request manager.

InSearchResultBeanContains two attributes:htmlPathAndhtmlTitleAnd the get and set methods of these two attributes. This also means that our search results contain two attributes:htmlPathAndhtmlTitle, WherehtmlPathRepresents the path of the HTML file,htmlTitleIndicates the title of an HTML file.

  1. Index Subsystem: ClassIndexManagerUsed to implement this subsystem. Listing 8 shows the source code of this class.

Listing 8: Implementation of the index Subsystem

package sample.dw.paper.lucene.index;import java.io.File;import java.io.IOException;import java.io.Reader;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import sample.dw.paper.lucene.util.HTMLDocParser;/** * This class is used to create an index for HTML files * */public class IndexManager {    //the directory that stores HTML files     private final String dataDir  = "c://dataDir";    //the directory that is used to store a Lucene index    private final String indexDir = "c://indexDir";    /**     * create index     */    public boolean createIndex() throws IOException{        if(true == ifIndexExist()){            return true;        }        File dir = new File(dataDir);        if(!dir.exists()){            return false;        }        File[] htmls = dir.listFiles();        Directory fsDirectory = FSDirectory.getDirectory(indexDir, true);        Analyzer  analyzer    = new StandardAnalyzer();        IndexWriter indexWriter = new IndexWriter(fsDirectory, analyzer, true);        for(int i = 0; i < htmls.length; i++){            String htmlPath = htmls[i].getAbsolutePath();            if(htmlPath.endsWith(".html") || htmlPath.endsWith(".htm")){        addDocument(htmlPath, indexWriter);        }        }        indexWriter.optimize();        indexWriter.close();        return true;    }    /**     * Add one document to the Lucene index     */    public void addDocument(String htmlPath, IndexWriter indexWriter){        HTMLDocParser htmlParser = new HTMLDocParser(htmlPath);        String path    = htmlParser.getPath();        String title   = htmlParser.getTitle();        Reader content = htmlParser.getContent();        Document document = new Document();        document.add(new Field("path",path,Field.Store.YES,Field.Index.NO));        document.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED));        document.add(new Field("content",content));        try {              indexWriter.addDocument(document);    } catch (IOException e) {              e.printStackTrace();          }    }    /**     * judge if the index exists already     */    public boolean ifIndexExist(){        File directory = new File(indexDir);        if(0 < directory.listFiles().length){            return true;        }else{            return false;        }    }    public String getDataDir(){        return this.dataDir;    }    public String getIndexDir(){        return this.indexDir;    }}

This class contains two private attributes:dataDirAndindexDir.dataDirIndicates the path of the HTML page waiting for indexing,indexDirRepresents the path for storing Lucene index files. ClassIndexManagerThree methods are provided:createIndex,addDocumentAndifIndexExist. If the index does not exist, you can usecreateIndexCreate a new index.addDocumentAdd a document to an index. In our scenario, a document is an HTML page. MethodaddDocumentWill callHTMLDocParserProvides methods to parse HTML documents. You can use the last MethodifIndexExistTo determine whether Lucene indexes already exist.

Now let's take a look at the packagesample.dw.paper.lucene.utilClassHTMLDocParser. This class is used to extract text information from HTML files. This class contains three methods:getContent,getTitleAndgetPath. The first method returns the text content marked with HTML, the second method returns the title of the HTML file, and the last method returns the path of the HTML file. Listing 9 shows the source code of this class.

Listing 9: HTML Parser

package sample.dw.paper.lucene.util;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.io.Reader;import java.io.UnsupportedEncodingException;import org.apache.lucene.demo.html.HTMLParser;public class HTMLDocParser {    private String htmlPath;    private HTMLParser htmlParser;    public HTMLDocParser(String htmlPath){        this.htmlPath = htmlPath;        initHtmlParser();    }    private void initHtmlParser(){        InputStream inputStream = null;        try {            inputStream = new FileInputStream(htmlPath);        } catch (FileNotFoundException e) {            e.printStackTrace();        }        if(null != inputStream){        try {                htmlParser = new HTMLParser(new InputStreamReader(inputStream, "utf-8"));            } catch (UnsupportedEncodingException e) {                e.printStackTrace();            }        }    }    public String getTitle(){        if(null != htmlParser){            try {                return htmlParser.getTitle();            } catch (IOException e) {                e.printStackTrace();            } catch (InterruptedException e) {                e.printStackTrace();            }        }    return "";    }    public Reader getContent(){    if(null != htmlParser){            try {                  return htmlParser.getReader();              } catch (IOException e) {                  e.printStackTrace();              }        }        return null;    }    public String getPath(){        return this.htmlPath;    }}

5. Run the application on Tomcat 5.0

Now we can run the developed application on Tomcat 5.0.

  1. Right-clickSearch. jspAnd then selectRun as> run on server, 7.

 Figure 7: Configure Tomcat 5.0

  1. In the displayed window, selectTomcat V5.0 ServerAs the target Web application server, and then clickNext, As shown in Figure 8:

Figure 8: Select Tomcat 5.0

  1. Now you need to specify the path of Apache Tomcat 5.0 and JRE used to run the web application. Here, the JRE version you selected must be the same as the JRE version you used to compile the Java file. After configuration, clickFinish. See Figure 9.

Figure 9: complete Tomcat 5.0 Configuration

 

  1. After configuration, Tomcat runs automatically and compiles search. jsp and displays it to users. See Figure 10.

 Figure 10: User Interface

  1. Enter the keyword "information" in the input box and clickSearchButton. Then, the search results are displayed, as shown in Figure 11.

Figure 11: Search Results

 

  1. Click the first link of the search result. The content of the linked page is displayed. As shown in Figure 12.

Figure 12: Details

 

Now we have successfully completed the development of the sample project, and successfully implemented the search and index functions using Lucene. You can download the source code of this project ).

Summary

Lucene provides flexible interfaces to facilitate the design of our web search applications. If you want to add the search function to your application, Lucene is a good choice. When designing your next application with the search function, you can use Lucene to provide the search function.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.