Introduction to Information Retrieval: Chapter 1 Boolean retrieval (1)

Source: Internet
Author: User

The word Information Retrieval has a wide meaning. Only retrieve the credit card from the wallet and enter the credit card number. However, from an academic point of view, information retrieval is defined as follows:

Information Retrieval refers to the process of finding documents that meet the needs from a large number of unstructured documents.

According to the above definition, information retrieval was once an activity involving only a few book administrators, lawyers, and professional searchers. Today, thousands of people use search engines to search web pages and emails. Information retrieval is quickly replacing the traditional database search method and becoming the main method for obtaining information. In addition, information retrieval technology can solve other problems related to data and information. Unstructured data refers to data without clear semantic structures that can be understood by computers. Structured data, such as traditional relational databases, is used by many companies to store product inventory and employee information. In the real world, almost no data is completely unstructured, especially when the potential semantic structure of human language is taken into account. Even if we think that only the structure marked intentionally is structured, most documents have headers, bodies, subscripts, and other structures that are explicitly labeled in the document. Information Retrieval Technology can also perform semi-structured searches, such as searching for documents whose titles contain Java and whose bodies contain threading.

Information Retrieval also includes helping users browse, filter document sets, and reprocess the search results.

Clustering: The process of grouping documents based on their content. It is similar to dividing books into different bookshelves based on their themes.

Classification: the process of determining the category of each document by specifying some categories and a set of documents. Generally, some documents are manually classified in this process, so that the new documents can automatically identify their categories.

Information retrieval systems can be divided into the following categories based on their scale:

Internet Search: the system's Search objects are billions of documents stored on millions of computers. The main problems faced by Internet search systems are how to obtain documents to be indexed, how to efficiently process large volumes of data, and how to deal with unique Internet problems, such as tracking and mining hyperlinks, prevent website spoofing (in view of the commercial value of the Internet, some sites will modify the webpage content to obtain a higher ranking ).

Personal information retrieval: in recent years, personal computer operating systems have begun to integrate information retrieval systems. The mail system generally provides not only the search function, but also the text classification function, that is, it provides at least the spam filter, and generally provides an automatic or manual classifier, so that different emails are placed in different folders. The main problems faced by such systems include how to handle a variety of file types on personal computers, free maintenance of the system, lightweight enough for startup, processing, and disk usage, this does not affect normal use.

Enterprise search: searches for internal documents, patent databases, and research papers. In this case, documents are often stored in a unified file system. One or more dedicated computers are complicated to search for them.

The technology described in this book covers all of the above fields. However, the parallel and distributed searches in internet search systems are relatively less involved, because there are fewer papers on this aspect. However, in addition to several internet search companies, most programmers are more likely to be exposed to personal information retrieval systems and enterprise-level search systems.

This chapter introduces the concept of term-document matrix and the most important inverted table data structure from a simple information retrieval problem. Then we will introduce the Boolean search model and how to process the Boolean query.

A simple information retrieval Problem

I believe many of my friends have a complete set of Shakespeare's brick book. Suppose you want to find out that Shakespeare's opera contains Brutus and Caesar, but not Calpurnia. One way is to read the entire book from the beginning to the end and find out the opera that contains Brutus and Caesar but does not contain Calpurnia. This is the simplest way to retrieve documents, called sequential scanning. This process is often called grep, just like what Unix Commands do. Sequential scanning can be very effective, especially when the processing speed of modern computers is high, and Wildcards are often allowed. The speed of modern computers is slow enough for simple queries such as Shakespeare's complete set of text smaller than a million words.

However, in some cases, this is not the case:

1. process a large number of documents quickly. The amount of data on the Internet has grown faster than the computer's processing speed, and we expect to be able to search for a collection of documents containing billions or even trillions of words.

2. Supports more flexible query methods. For example, it is impossible to use grep to search for Romans NEAR countrymen. Here, NEAR indicates that it is within five words or in one sentence.

3. allows ranking of search results. In most cases, what you want is the best document that contains a specific word that best meets your search needs.

An alternative to sequential scanning is to index documents in advance. Let's take the complete works of Shakespeare as an example to introduce the basic concepts of the Boolean search model. Let's assume that we record each document (an opera of Shakespeare here) whether it contains the words used by each Shakespeare (Shakespeare used 32000 words in total ). The result is a binary term-document index matrix. Term is the basic unit of index, most of which is a word, at least now you can think so, but some words are not just words, such as I-9 or Hong Kong, therefore, in the field of information retrieval, we call it a term ). When we look at this matrix by line, we will get a vector, indicating that each entry has appeared in those documents. When we look at this matrix by column, we will get a vector to indicate which words have appeared in this document.

To obtain the query results of Brutus AND Caesar and not Calpurnia, we first obtain the vectors of Brutus, Caesar, AND Calpurnia, AND obtain the inverse of the last vector, then perform the binary AND operation (AND) on the three vectors ):

110100 AND 110111 AND 101111 = 100100

Finally, we were given two matching Operas: "Antony and Cleopatra" and "Hamlet ".

 

The Boolean search mode concatenates words by using and, or not according to the Boolean expression to form a query statement, and regards each document as a collection of words.

Now let's consider a more practical situation and use this opportunity to introduce some terms and symbols. Assume that we have a Document with N = 1 M. Here, the Document is a unit of any information retrieval system we have built. A memorandum can be created, it can also be one or several chapters in a book. A collection is a set of documents to be searched. It is also called a corpus (corpus ). Assume that each document contains 1000 words (two or three pages of a book ). Assuming that each word (including space and punctuation) occupies an average of 6 bytes, this document set contains about 6 GB. Generally, there are about M = 500000 different words in these documents. We did not deliberately choose these numbers, and they may also vary with the document volume, but gave us a question that must be addressed, that is, the data volume. We will discuss these assumptions about the data volume in section 5.1 and model them.

Our goal is to develop a system that can complete such a specific search task. This standard information retrieval task allows the system to provide documents related to the information retrieval requirements of any user based on the one-time query statements initiated by the user. The so-called information retrieval requirement is the topic that the user expects to know. It is different from query. The so-called query is a computer-understandable way for the user to list their information retrieval requirements. A document is relevant, that is, the document is considered to contain information related to the information retrieval requirements. In the above example, the information retrieval requirement is expressed as a combination of a series of specific words, which is set to express the problem. In real life, for example, users are concerned about topics related to "pipe leakage" (information retrieval requirements), but the documents they are looking for may not necessarily contain these words accurately, they may also use other words to express their needs, such as "Pipeline Burst" (query ). To measure the effectiveness of the information retrieval system, you may expect to know two important statistical indicators of the query results:

Precision: the proportion of returned results related to information retrieval requirements.

Recall rate (recall): the proportion of relevant documents in the document set is returned.

In chapter 8, we will discuss in detail the evaluation of relevance, including accuracy and recall rate.

Now, we cannot simply construct a word-document matrix. A matrix of 500 K x 1 M has half a MB of 0 and 1, which is too large to be stored in the memory. However, an important discovery is that the matrix is sparse, and non-zero items only occupy a small part. Because each document contains only 1000 words, the matrix contains no more than one billion items, that is, at least 99.8% items are 0. Therefore, a better representation method is to record items with a value of 1.

With this concept, we can easily obtain an important concept of information retrieval: inverted index ). The reverse index name is actually redundant, because an index is always mapped from the entry to the document containing it. Even so, reverse indexing, or reverse files, has become a standard term in the field of information retrieval. Figure 1.3 shows the basic concepts of reverse indexing. We maintain a dictionary consisting of entries. Each entry contains a list of documents containing the entry. Each item in the list is called a posting, which is called a Posting list (inverted table ). The dictionaries in Figure 1.3 are sorted alphabetically, and documents in each inverted table are sorted by document number. In section 1.3, we can see that this sort is very useful. In section 7.1.5, we also considered other solutions.

Try to build reverse Index

In order to obtain the index speed advantage in the search phase, we must create an index in advance. The main steps are as follows:

1. Collect documents to be indexed

Friends, Romans, countrymen. So let it be with Caesar ......

2. Perform word segmentation on these texts to convert documents into word sequences.

Friends Romans Countrymen So ......

3. Perform Language Processing and standardize the words in this sequence to form the entry

Friend Roman Countryman So ......

4. Create a reverse index for the entry and document, including the dictionary and inverted table.

We will define and discuss steps 1 to 3 in section 2.2. Before that, you can simply think that words or standardized words are equivalent to words. Here, we assume that the first three steps have been completed, and we will focus on how to build a basic reverse index by sorting.

In a document set, assume that each document has a unique serial number, which is called the document number. During index creation, we can simply attach a continuous integer to each new document as the document number. For each document, the index input is a series of standardized words. We can also think of it as a binary combination of a series of entries and document numbers, as shown in Figure 1.4. A core step in the indexing phase is to sort these entries alphabetically, as shown in the middle column of 1.4. The entries that appear in the same document are merged multiple times. The same entries are merged and the results are divided into two parts: dictionary and inverted table, as shown in the right column of 1.4. Since one entry usually appears in a small number of documents, this data organization method has reduced the storage space occupied by indexes. The dictionary also stores some statistical information, such as how many documents contain this term (document frequency, document frequency ). This information is not very important to the Boolean search engine, but it can improve the efficiency in the search stage and play a role in the information search model to be sorted. The inverted table is sorted by document number, which lays the foundation for efficient search processing. Reverse indexing is undoubtedly the most efficient structure for such specific information retrieval.

In the final index, we store the dictionary and inverted table. The latter occupies a larger space, the dictionary is Saved to memory, and the inverted table is saved to hard disk. Therefore, the space occupied by the two is very important. In chapter 5, we will discuss how to optimize the storage of the two to improve access efficiency.

What kind of data structure should be used in the inverted table? A fixed-length array may be a waste, because some words appear in many documents, while some words appear in few documents. There are two good options for the inverted table in the memory: one is a single-chain table and the other is a variable-length array. A single-chain table makes it cheaper to insert a document into an inverted table and can be naturally extended to a more advanced index method such as a jump table. However, it requires extra space to store pointers. Variable-length arrays avoid extra space brought by pointers in terms of space requirements and speed up the use of continuous memory in terms of time requirements. In practice, extra pointers can be encoded as offsets in the linked list. If updates are less frequent, the variable-length array is more compact and easier to traverse. We can also use a compromise method to save a fixed-length array linked to each entry. When an inverted table is stored on a hard disk, it can be stored as a continuous space without explicit pointers, this reduces the size of the inverted table and the number of times the inverted table is read from the hard disk by reading it into the memory.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.