Introduction to Information Retrieval

Source: Internet
Author: User

Books watercress link: http://book.douban.com/subject/5252170/

Chapter 2 Boolean search

---------------------

1.1 An example of Information Retrieval

1.2 initial experience of constructing inverted Indexes

1.3 processing of Boolean queries

1.4 extended Boolean search model and ordered search

---------------------

Information retrieval is a collection of large-scale unstructured data (usually text) (usually stored on computers) to find information that meets user information requirements (usually documents). Unstructured data refers to data without clear and explicit semantic structures, which is not easy for computers to process. Typical "Structured Data" relational database. Text data is considered as "semi-structured data" (semistructured)
Data ).

Given a document set, clustering is a task of automatic clustering based on the document content without topic guidance in advance, while classification defines the topic in advance.

The data processed by information retrieval can be divided into large-scale (such as Web search), small-scale (such as personal information search), and medium-scale (such as search for enterprises, institutions, and specific fields ).

Linear scanning (grepping) is the simplest, but it cannot meet the needs of quick searches for large-scale documents, flexible matching methods, and sorting of results. Therefore, one method is to create an index in advance to obtain the word item-document association matrix (incidence matrix) consisting of Boolean values ):


Evaluate the search results:

Precision: percentage of documents that are true and information requirements in the returned results.

Recall rate (recall): Percentage returned by the retrieval system in all documents that are really related to information requirements.

Word term-the document association matrix is a sparse matrix, which occupies a large amount of space. Therefore, use the inverted index (inverted index ):


Procedure for creating an index:

1. Collect documents; 2. tokenization; 3. Linguistic preprocessing, produce normalized entries as word items; 4. Create inverted indexes for all documents based on word items



Both the dictionary and inverted record tables have storage costs. The class uses singly linked list and variable length array ).

Boolean query uses the merge algorithm (merge algorithm) to calculate the intersection of the inverted record table. The merge here refers to the operation, which is different from the merge sort in the sort algorithm.

Process word items in ascending order of Document Frequency for query optimization


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.