Build a high-speed search engine using python and xapian

Source: Internet
Author: User
Using python and xapian to build a high-speed search engine, we first understand several concepts: Documents, terms, and posting. in information retrieval (IR), we attempt to obtain the item "document ", each document is described by a terms set. "Document" and "term" are terms in IR. they are from "Library Management. Generally, a document is considered as a piece of text (Usually a document is thought of as a piece of text, most likely in a machine readable form ), A term is a word or phrase used to describe the document. in a document, many terms exist, for example, if a document is related to _ oral _ hygiene _, the following terms may exist: "tooth", "teeth", "toothbrush", "decay", "cavity", "plaque" or "diet.

If a document named D exists in an IR system and is described by a term named t, t is considered to have indexed D, which can be shown in the following table: t-> D. In an IR system, there are usually multiple documents, such as D1, D2, D3... A collection composed of multiple terms, such as t1, t2, t3... A collection, which has the following relationships: ti-> Dj.

If a specific term is indexed to a specific document, it is called posting. to put it bluntly, posting is a term with position information, which can be used in relevance search.

Given a document named D, there is a terms list that indexes it, which we call D's term list.

Given a term named t, it indexes a list of documents, which is called t's posting list (using "Document list" may be more consistent in the name, but it sounds too vague ).

In an IR system on a computer, terms is stored in an index file. Term can be used to effectively search for its posting list. in the posting list, each document carries a very short identifier, that is, the document id. Simply put, a posting list can be considered as a collection composed of document ids, while a term list is a collection composed of strings. In some IR systems, numbers are used internally to represent the term. Therefore, in these systems, the term list is a set of numbers, and Xapian is not, it uses original term and prefix to compress the bucket.

Terms is not necessarily a word in document. usually they are converted to lower-case characters, and they are often processed by the stem extraction algorithm, therefore, a series of words, such as "connect", "connects", "connection", and "connected", may be retrieved through a term with a value of "connect, A word may also produce multiple terms. for example, you can index the extracted stem and unextracted words. Of course, this may only apply to English, French, Latin, and other European and American languages, while Chinese word segmentation is very different. In general, the word segmentation of the European and American languages differs from that of the Chinese languages in the following ways:

L. in English, every word in English is usually separated by space, while Chinese is not. even the entire article does not contain spaces or punctuation marks. 2. as mentioned above, "connect", "connects", "connection", or "connected" respectively mean "verb-type connection", "verb-type third-person connection", and "name-type connection ""or" connection past tense ", however, in Chinese, "join" can be used to represent all, and almost no stem extraction is required. This means that most parts of speech in English are based on rules, while the part of speech in Chinese is just like a word. 3. the second point is just a microcosm of the difficulties in Chinese word segmentation. it is very difficult to fully and correctly identify the semantic meaning of a sentence, for example, the sentence "The People's Republic of China was founded, words such as "China", "Chinese", "people", "Republic", and "founding" can be separated. However, "Chinese" has little to do with this sentence. At a glance, it seems very simple, but is it easy for machines to understand the secrets?

Values

Values is a metadata appended to a document. each document can have multiple values, which are identified by different numbers. Values is designed to be quickly accessed during the matching process. they can be used for sorting, queuing redundant documents, and range retrieval. Although values does not have a length limit, it is best to keep them as short as possible. if you just want to store a field for display as a result, it is recommended that you store them in the data of the document.

Document data

Each Document has only one data and can be any type of data. of course, you must first convert it to a string during storage. This may sound odd. The truth is: if the data to be stored is in text format, it can be stored directly; if the data to be stored is a variety of objects, serialize the data to a binary stream before saving it, and deserialize the data during reading.

Posting

Posting is a term with position.

# -*- coding: gb18030 -*-import xapiantestdatas = [u'abc test python1',u'abcd testing python2']def buildtest():    database = xapian.WritableDatabase('indexes/', xapian.DB_CREATE_OR_OPEN)    stemmer = xapian.Stem("english")    for data in testdatas:        doc = xapian.Document()        doc.set_data(data)        for term in data.split():            doc.add_term(term)        database.add_document(doc)if __name__ == '__main__':    buildtest()

After Execution, the index Library is generated in the current directory.

[Sh]

[Ec2-user @ ip-10-167-6-221 indexes] $ ll

Total usage 52

-Rw-r -- 1 ec2-user ec2-user 0 July 28 16:06 flintlock

-Rw-r -- 1 ec2-user ec2-user July 28 16:06 iamchert

-Rw-r -- 1 ec2-user ec2-user 13 July 28 16:06 postlist. baseA

-Rw-r -- 1 ec2-user ec2-user 14 July 28 16:06 postlist. baseB

-Rw-r -- 1 ec2-user ec2-user 8192 July 28 16:06 postlist. DB

-Rw-r -- 1 ec2-user ec2-user 13 July 28 16:06 record. baseA

-Rw-r -- 1 ec2-user ec2-user 14 July 28 16:06 record. baseB

-Rw-r -- 1 ec2-user ec2-user 8192 July 28 16:06 record. DB

-Rw-r -- 1 ec2-user ec2-user 13 July 28 16:06 termlist. baseA

-Rw-r -- 1 ec2-user ec2-user 14 July 28 16:06 termlist. baseB

-Rw-r -- 1 ec2-user ec2-user 8192 July 28 16:06 termlist. DB

Next we will introduce how to query indexes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.