Inverted index principle

Source: Internet
Author: User
Word document Matrix (search engine index is actually the implementation of "word-document Matrix" concrete data structure)

Inverted index Basic Concepts

Document: The general search engine is dealing with the Internet Web page, and the concept of the document is more broad, representing the existence of text-based storage objects, compared to the Web page, covering more forms, such as word,pdf,html, Files in different formats, such as XML, can be referred to as documents.

document Collection: A collection of documents is called a collection of documents. For example, a huge amount of internet pages or a large number of e-mails are specific examples of document collections.

Document ID: Within the search engine, each document within the document collection is assigned a unique internal number, which is used as a unique identifier for this document, so that it is easy to process internally, and the internal number of each document is called the "document Number". The following article sometimes uses DocId to easily represent document numbers.

Word ID: Similar to the document number, the search engine internally represents a word with a unique number, and the word number can be used as the unique representation of a word.

Inverted Indexes (inverted index): Inverted indexes are a specific form of storage that implements the word-document matrix, and by inverted index, a list of documents containing the word can be quickly obtained based on the word. The inverted index consists mainly of two parts: the word dictionary and the inverted file.

Word dictionary (Lexicon): the usual index unit of a search engine is the word, which is a collection of strings consisting of all the words that appear in the document collection, and each index entry in the word dictionary records some information about the word itself and a pointer to the inverted list.

Inverted Arrangement Table (postinglist): The inverted table records the list of documents for all documents that have a word, and the location information that the word appears in the document, each of which is called an inverted item (Posting). You can tell which documents contain a word, based on the inverted list.

Inverted files (inverted file): The inverted list of all words is often stored sequentially in a file on disk, which is called an inverted file, and the inverted file is the physical file that stores the inverted index.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.