Books watercress link: http://book.douban.com/subject/5252170/
Chapter 2 Boolean search
---------------------
1.1 An example of Information Retrieval
1.2 initial experience of constructing inverted Indexes
1.3 processing of Boolean queries
1.4 extended Boolean search model and ordered search
---------------------
Information retrieval is a collection of large-scale unstructured data (usually text) (usually stored on computers) to find information that meets user information requirements (usually documents). Unstructured data refers to data without clear and explicit semantic structures, which is not easy for computers to process. Typical "Structured Data" relational database. Text data is considered as "semi-structured data" (semistructured)
Data ).
Given a document set, clustering is a task of automatic clustering based on the document content without topic guidance in advance, while classification defines the topic in advance.
The data processed by information retrieval can be divided into large-scale (such as Web search), small-scale (such as personal information search), and medium-scale (such as search for enterprises, institutions, and specific fields ).
Linear scanning (grepping) is the simplest, but it cannot meet the needs of quick searches for large-scale documents, flexible matching methods, and sorting of results. Therefore, one method is to create an index in advance to obtain the word item-document association matrix (incidence matrix) consisting of Boolean values ):
Evaluate the search results:
Precision: percentage of documents that are true and information requirements in the returned results.
Recall rate (recall): Percentage returned by the retrieval system in all documents that are really related to information requirements.
Word term-the document association matrix is a sparse matrix, which occupies a large amount of space. Therefore, use the inverted index (inverted index ):
Procedure for creating an index:
1. Collect documents; 2. tokenization; 3. Linguistic preprocessing, produce normalized entries as word items; 4. Create inverted indexes for all documents based on word items
Both the dictionary and inverted record tables have storage costs. The class uses singly linked list and variable length array ).
Boolean query uses the merge algorithm (merge algorithm) to calculate the intersection of the inverted record table. The merge here refers to the operation, which is different from the merge sort in the sort algorithm.
Process word items in ascending order of Document Frequency for query optimization