Make a search engine with Python (pylucene)

Source: Internet
Author: User
Tags python list

    1. What is a search engine?

Search engine is "the network information resources to collect and organize and provide information query service system, including information collection, information collation and user query three parts." 1 is the general structure of the search engine, the information collection module from the network to collect information from the network data base (the general use of reptiles); then the information sorting module set up the index table (usually inverted index) to form the index database after the segmentation of the collected information, the use of the word, and the weighted weight. Finally, the user query module can identify the user's search needs and provide retrieval services.

Figure 1 General structure of the search engine

2. Using Python to implement a simple search engine

2.1 Problem Analysis

As shown in Figure 1, a complete search engine architecture begins with the Internet gathering information and can be used to write a crawler in Python, which is the strength of Python.

Next, the information processing module. Word segmentation? Stop using words? Inverted table? What? What's the mess? Don't worry about it, we have the wheels built by our predecessors---pylucene (Lucene's Python package version, Lucene helps developers add search functionality to software and systems.) Lucene is a set of open source libraries for full-text search and search. Using Pylucene can simply help us to complete the processing of the collected information, including indexing and searching.

Finally, in order to be able to use our search engine on the web, we use flask, a lightweight web application framework, to make a small web page to get the search statement and feedback the search results.

2.2 Reptile Design

Mainly collects the following content: the title of the target page, the main text content of the target page, the URL of the target page to other pages. Web crawler Workflow 2. The main data structure of the crawler is the queue. First, the starting seed node enters the queue, then fetches a node access from the queue, crawls the target information on that node's page, places the node page in the queue with a URL link to another page, and then fetches the new node from the queue for access until the queue is empty. Through the queue "FIFO" features to achieve breadth-first traversal algorithm, one-by-one access to each page of the site.

Figure 2

Use of 2.3 pylucene

In Pylucene, there are directory, Analyzer, IndexWriter, Document and Filed in the class of indexing.

Directory is a class for file operations in Pylucene. It has simplefsdirectory and Ramdirectory, Compoundfiledirectory, Fileswitchdirectory and other 11 sub-classes, enumerated four are related to the preservation of the index directory of subclasses, Simplefsdirectory is to save the built index to the file system; Ramdirectory is to save the index to RAM memory; Compoundfiledirectory is a composite index save method While Fileswitchdirectory allows you to temporarily switch how indexes are saved to perform the advantages of various index preservation methods.

Analyzer, analyser. It is the class that the crawler obtains to process the text that will be indexed. Includes text for word breakers, removal of inactive words, conversion case, and so on. The pylucene comes with several analyzers, and a third-party parser or a self-write parser can be used when building an index. The quality of the index and the accuracy and speed of the search service can be related to the parser.

IndexWriter, the index is written to the class. IndexWriter can be used to write, modify, add, and delete indexes in the storage space opened by directory, but cannot be indexed.

Document, Documentation class. The basic unit for indexing in Pylucene is the document, which can be a Web page, an article, a message. Document is the unit that is used to build the index and the result unit when searching, it is designed reasonably to provide the personalized search service.

Filed, Domain class. A document can contain multiple fields (field). Filed is a part of document, such as the composition of an article may be the title of the article, the subject of the article, author, publication date and many other filed.

Use a page as a document containing three field URLs (URL), page title (title), and the main text content of the page. For the index storage option, use the Simplefsdirectory class to save the index to the file. The parser chooses Pylucene's own cjkanalyzer, which is good for Chinese language support, and is suitable for text processing in Chinese content.

The steps to build an index using Pylucene are as follows:

LUCENE.INITVM ()

INDEXIDR = Self.__index_dir

Indexdir = Simplefsdirectory (File (INDEXIDR)) ①

Analyzer = Cjkanalyzer (version.lucene_30) ②

Index_writer = IndexWriter (Indexdir, Analyzer, True, Indexwriter.maxfieldlength (512)) ③

Document = document () ④

Document.add (Field ("Content", str (page_info["content"), Field.Store.NOT, Field.Index.ANALYZED)) ⑤

Document.add (Field ("url", visiting, Field.Store.YES, Field.Index.NOT_ANALYZED)) ⑥

Document.add (Field ("title", str (page_info["title"]), Field.Store.YES, Field.Index.ANALYZED)) ⑦

Index_writer.adddocument (document) ⑧

Index_writer.optimize () ⑨

Index_writer.close () ⑩

There are 10 main steps to building an index:

① instantiates a Simplefsdirectory object, saves the index to a local file, and saves the path to the custom path "Indexidr".

② instantiates a Cjkanalyzer parser that version.lucene_30 the version number of the pylucene when the argument is instantiated.

③ instantiates a IndexWriter object that carries four parameters that are preceded by the instantiated Simplefsdirectory object and the Cjkanalyzer parser, and the Boolean variable True indicates that a new index is created. INDEXWRITER.MAXFIELDLENGTH Specifies the maximum number of domains (Filed) for an index.

④ instantiates a Document object, named document.

⑤ adds a domain with the name "content" for the document. The content of this field is the main text content of a webpage page that crawler obtains. The parameters of the operation are the field objects that are instantiated and used immediately, and the four parameters of the Field object are:

(1) "Content", the name of the domain.

(2) page_info["content", crawler collects the main text content of page page.

(3) Field.store is a variable that represents whether the value of the field can be restored to the original character, Field.Store.YES indicates that the content stored in the field can be restored to the original text content, field. Store.not means unrecoverable.

(4) The Field.index variable indicates whether the contents of the field are applied by the parser, field. index.analyzed represents the use of the parser for the domain character handling, field. Index. Not_analyzed indicates that the parser is not used to process characters for this domain.

⑥ Add a field with the name "url" to save the page address.

⑦ Add a field with the name "title" to hold the title of the page.

⑧ instantiates IndexWriter to the index file as if the document is being written.

⑨ optimizes the index library file, merging small files in the index library into large files.

⑩ close the IndexWriter image after the build index operation completes in a single cycle.

Pylucene on indexed search classes are mainly Indexsearcher, Query, queryparser[16].

Indexsearcher, Index Search class. Used for search operations in the index library built by IndexWriter.

Query, which describes the class that queries the request. It submits the query request to Indexsearcher to complete the search operation. Query has many subclasses to complete different query requests. For example, Termquery is a search by entry, which is the most basic and simplest type of query used to match a particular item's document in a specified field; Rangequery, specifying a range of searches to match a specific range of documents in a specified field; fuzzyquery, a fuzzy query, It is easy to identify the items that are similar to the query keyword semantics.

Queryparser,query parser. When you need to implement different query requirements, you must use the different subclasses provided by query, which makes the query easy to use and confusing. Thus Pylucene also provides the query syntax parser Queryparser. Queryparser can parse the submitted query statement, select the appropriate query subclass according to query syntax to complete the corresponding queries, developers do not have to worry about what the underlying query implementation class. For example, the query statement "Keyword 1 and keyword 2" queryparser resolves to a document that matches both the keyword 1 and the keyword 2, and the lookup statement "id[123 to 456]" queryparser resolves to a value in the domain with the query name "ID" in the specified range Documents between "123" and "456"; the query statement "keyword site:www.web.com" queryparser resolves to a document that has a value of "www.web.com" and two query criteria matching "keyword" in a domain with the name "site".

Index Search is one of the areas that Pylucene focuses on, writing a class called query for the search that implements the index, and the following are the main steps for query to implement indexed searches:

LUCENE.INITVM ()

If Query_str.find (":") ==-1 and Query_str.find (":") ==-1:

Query_str= "title:" +query_str+ "OR content:" +query_str①

Indir= simplefsdirectory (File (self.__indexdir)) ②

Lucene_analyzer= Cjkanalyzer (version.lucene_current) ③

Lucene_searcher= Indexsearcher (Indir) ④

My_query = Queryparser (version.lucene_current, "title", Lucene_analyzer). Parse (QUERY_STR) ⑤

Total_hits = Lucene_searcher.search (My_query, MAX) ⑥

For hits in Total_hits.scoredocs:⑦

Print "Hit score:", Hit.score

doc = Lucene_searcher.doc (hit.doc)

Result_urls.append (doc.get ("url"). Encode ("Utf-8"))

Result_titles.append (Doc.get ("title"). Encode ("Utf-8"))

Print Doc.get ("title"). Encode ("Utf-8")

result = {"Hits": total_hits.totalhits, "url": Tuple (Result_urls), "title": Tuple (Result_titles)}

return result

There are 7 main steps to searching for an index:

① first to judge the search statement, if the statement is not for the title or article content for a single-domain query, that does not include the keyword "title:" or "Content:" When the default search title and content two fields.

② instantiates a Simplefsdirectory object, specifying its working path as the path to which the index was previously created.

③ instantiates a Cjkanalyzer parser, and the parser used in the search should be consistent with the type version of the parser used when the index was built.

④ instantiates a Indexsearcher object lucene_searcher, whose argument is the Simplefsdirectory object of step No. 02.

⑤ instantiates a Queryparser object My_query, which describes the query request and parses the query statement. The parameter version.lucene_current is the version number of Pylucene, "title" refers to the default search domain, Lucene_analyzer Specifies the parser used, and QUERY_STR is the query statement. The user search request is handled simply before instantiating the queryparser, and if the user specifies to search for a domain, it searches for "title" and "Content" two fields if the user is not specified.

⑥lucene_searcher the search operation and returns the result set total_hits. The total_hits contains the total number of results totalhits, and the document set Scoredocs,scoredocs for the search results includes the searched document and the score of each document's relevance to the search statement.

The result set that ⑦lucene_searcher searches for cannot be processed directly by Python, so the results should be converted from Pylucene to normal Python data structures before the search operation returns results. Use the For loop to process each result sequentially, placing the value of their address field "url" in the Python list result_urls the result document by their relevance score, and placing the value of the title field "title" in the list result_titles. Finally, a Python "dictionary" containing the address, the list of headings, and the total number of results is used as the return value for the entire search operation.

When the user enters a search term in the browser search box and clicks Search, the browser initiates a GET request, and the Flask Route route is set by the result function to respond to the request. The result function instantiates a search class query object Infoso, passes the search word to the object, and Infoso completes the search to return the result to the function result. The function result passes the searched page and the total number of results to the template result.html, and the template result.html is used to render the result

This is the code python uses to process search requests using the Flask module:

App = Flask (__name__) #创建Flask实例

@app. Route ('/') #设置搜索默认主页

def index ():

Html= "

Return render_template (' index.html ')

@app. Route ("/result", methods=[' get ', ' post ') #注册路由, and specify the HTTP method as GET, post

def result (): #resul函数

If request.method== "GET": #响应GET请求

Key_word=request.args.get (' word ') #获取搜索语句

If Len (Key_word)!=0:

Infoso = Query ("./glxy") #创建查询类query的实例

Re = Infoso.search (Key_word) #进行搜索, return result set

So_result=[]

N=0

For item in re["URL"]:

temp_result={"url": Item, "title": re["title"][n]} #将结果集传递给模板

So_result.append (Temp_result)

N=n+1

Return render_template (' result.html ', Key_word=key_word, result_sum=re["Hits"],result=so_result)

Else

Key_word= ""

Return render_template (' result.html ')

if __name__ = = ' __main__ ':

App.debug = True

App.run () #运行web服务

Make a search engine with Python (pylucene)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.