Build a high-speed search engine using Python and Xapian

Source: Internet
Author: User
First understand a few concepts: Documents, terms and posting in the Information Retrieval (IR), we are trying to get the item called "Document", each document is described by a terms collection. The words "document" and "term" are the terms in IR, which come from "library management". Usually a document is considered a piece of text, (usually a document is thought of as a piece of the text, most likely in a machine readable form), and a term is a word or phrase used to describe the document, most of the document will have more than one term, for example, a document is related to the oral _ _ _ health _, then there may be the following terms: "tooth", "teeth", " Toothbrush "," decay "," cavity "," plaque "or" diet "and so on.

If there is a document named D in an IR system, this document is described by a term named T, then T is considered to be indexed d, and can be represented by the following: T->d. In the actual application of an IR system is usually a number of documents, such as D1, D2, D3 ... A collection of components, and there are several term, such as T1, T2, T3 ... The composition of the collection, thus having the following relationship: TI-and Dj.

If a particular term index a particular document, then called posting, posting is a term with position information, in the relevance of the search may have a certain purpose.

Given a document named D, there is a terms list indexed to it, which we call the term list of D.

Given a term named T, it indexes a list of documents, which is called the posting list of T (using the "Document list" may be more consistent in the name, but it sounds too vague).

In an IR system that exists in the computer, terms is stored in the index file. Term can be used as an effective way to find its posting list, in the posting list, each document has a very short identifier, which is the document ID. Simply put, a posting list can be thought of as a collection of document IDs, and the term list is a collection of strings. In some IR systems, numbers are used to denote term, so in these systems, the term list is a collection of numbers, and Xapian is not, it uses the original term, and uses the prefix to compress the storage space.

Terms is not necessarily the word that appears in document, usually they are converted to lowercase, and often they are processed by the stemming algorithm, so a term called "connect" may retrieve a series of words, example "connect", "connects "," Connection "or" connected ", and a word may produce multiple terms, for example, you will index the extracted stems and the words that are not extracted. Of course, this may only be used in English, French or Latin and other European and American series of languages, and Chinese participle is very different, in general, the European and American language word segmentation and Chinese participle has the following differences:

L. In English, usually the English language is separated by a space between each word, while the Chinese is not, or even extreme to the entire article will not appear space or punctuation. 2. As mentioned above, "Connect", "connects", "Connection" or "connected" respectively mean "connection of the verb nature", "connection of the third person of the verb nature", "connection of the name nature" or "connected past tense", but in Chinese, "Connection" can be expressed in all, and there is little need for stemming. This means that the various parts of speech in English are rule-based, while the Chinese part of speech is unconstrained. 3. The 2nd is a microcosm of the difficult Chinese word segmentation, it is very difficult to correctly identify the semantics of a sentence, for example, "The People's Republic of China set up" This sentence can be divided into "Chinese", "Chinese", "people", "Republic", "set up" and other words, but "Chinese" It doesn't really matter much to this sentence. I look simple at one glance, but the machine is so easy to understand the mystery of it?

Values

Values are a meta-data attached to a document, and each document can have multiple values, which are identified by different numbers. Values are designed to be accessed quickly during the matching process, and they can be used as a sort, queued redundant document, and range retrieval. Although values do not have a length limit, it is best to keep them as short as possible, and if you simply want to store a field to display as a result, it is recommended that you save them in the document's data.

Document data

Each document has only one data, which can be in any type of format and, of course, is converted to a string when stored. This may sound a bit odd, the truth is this: if the data to be stored is a text format, it can be stored directly, if the data to be stored is a variety of objects, first serialized into a binary stream and then saved, and read the time to deserialize the read.

Posting

Posting is a term with position.

#-*-coding:gb18030-*-import xapiantestdatas = [u ' ABC Test python1 ', U ' ABCD testing python2 ']def buildtest ():    Databa SE = Xapian. Writabledatabase (' indexes/', Xapian. Db_create_or_open)    stemmer = Xapian. Stem ("中文版") for the    data in Testdatas:        doc = Xapian. Document ()        doc.set_data (data) for term in        data.split ():            doc.add_term (term)        database.add_document ( DOC) If __name__ = = ' __main__ ':    buildtest ()

After execution, the index library is generated under the current directory.

[SH]

[ec2-user@ip-10-167-6-221 indexes]$ LL

Total Dosage 52

-rw-rw-r--1 ec2-user ec2-user 0 July 16:06 Flintlock

-rw-rw-r--1 ec2-user ec2-user 28 July 16:06 Iamchert

-rw-rw-r--1 ec2-user ec2-user 13 July 16:06 Postlist.basea

-rw-rw-r--1 ec2-user ec2-user 14 July 16:06 Postlist.baseb

-rw-rw-r--1 ec2-user ec2-user 8192 July 16:06 postlist. Db

-rw-rw-r--1 ec2-user ec2-user 13 July 16:06 Record.basea

-rw-rw-r--1 ec2-user ec2-user 14 July 16:06 Record.baseb

-rw-rw-r--1 ec2-user ec2-user 8192 July 16:06 record. Db

-rw-rw-r--1 ec2-user ec2-user 13 July 16:06 Termlist.basea

-rw-rw-r--1 ec2-user ec2-user 14 July 16:06 Termlist.baseb

-rw-rw-r--1 ec2-user ec2-user 8192 July 16:06 termlist. Db

We'll show you how to query the index in the next article.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.