[Reprint] The nature of index in information retrieval

Source: Internet
Author: User

Original: http://my.oschina.net/zjzhai/blog/464446

A better article on popular science, the introduction of inverted index.

If there is an incorrect or understanding not in place, welcome to treatise.

Information retrieval Issues

First, let's look at the problem domain. Each technology product is to solve a certain kind of problem. Without starting from the problem domain, it's hard to understand why it's like this. Like those who have not learned the "programming language" design, can only be led by the language of the program.

The model behind information retrieval is simple: it's about finding the information you need from a lot of information. This type of problem has a more professional name: Information retrieval (information retrieval). In life, such problems abound:

    • How can we quickly find a word in the first few pages of the book?

    • If there are no search engines and catalogs, how do we find the books we want in a large library?

    • Is it a little inefficient for people looking for a house to find their desired home by browsing through each of their listings?

    • How to conveniently find all the Chinese restaurant nearby?

Solve information retrieval problems

This is just the simplest aspect of the information retrieval. There are more complex problems behind it.

Imagine that you need to find the largest number in the following list:

1, 23, 56, 3, 40, 41.1, 900, 12

This is simple, you can get the answer at the glance of the past, but what if there are 10 billion numbers? If a person can identify the largest number of 10 numbers in a second, then he 365x24 hours to see, it will take 31.6887646 years to get the answer.

Such a boring and inhumane thing, we give it to the machine. This leads us to the first question of information retrieval: It is so much that we are unable to find what we really want in a limited time and need a machine to help us.

But how does the machine know what we're looking for? At this point, the second question that leads to our information retrieval: How to make the machine understand what we are looking for? the answer is simple: we tell it not to. This idea is right, in my opinion. Along this line of thinking, I think we will probably encounter the following problems:

1. How do we humans express the question "What are we looking for?"

Our usual practice is to give a partial feature of the information we are looking for. In the search field, the term "part of the feature" is called a keyword. in fact, "keywords" are the "characteristics" of the information we're looking for in our brains.

2. How do we make the machine understand the "What are we looking for" question?

The problem is very complicated. When we search for "IR" in Google, how does it know if I want to search for "information retrieval" or "Ingersoll Rand"?

3. How does the machine know what information we are looking for?

The focus of this article is to answer question 3rd. Although in essence, these 3 questions should be discussed together.

The nature of the index

In the face of the question of how the machine knows what information we're looking for, the solution domain model I've seen is: Tell the machine how to extract (or one-by-one) the characteristics of the information (index), and then the machine takes these "indexes" in contrast to the characteristics (keywords) of the information in the brain of the searchers, and then knows

Here, we have come to the essence of the index : The characteristics of information .

Go back to the first example:

    • Dictionaries are sorted alphabetically and sorted by stroke, letters and strokes are characters, so you can index them.

    • Each book in the library will have a number that can be represented by a meaningful letter, such as T2300004, which represents the 2 floor of the science and Technology class, 3 rows and so on. We can easily find the book based on this number. Here is a hint that when the information being searched is not obvious, we can add it manually. For example, a movie, we can manually add tags such as action films, Jane Eyre, in order to search. Because some people do not necessarily search by the name of the movie, he may also search for "love movies"

    • Nearby restaurants can be characterized by: geographical coordinates, whether delicious, the price is preferential.

Since the index is the characteristic of information, how do we organize the index to make it easier for us to use the index? There are currently two ways to organize an index:

    • Associating a batch of features after information

    • Correlate a batch of information after each feature

Forward index: Associating a batch of features after information

In my experience, a positive index is a better understanding of the nature of the index than a reverse index.

In fact, the structure of the forward index is simple:

Reverse index: Correlate a batch of information after each feature

In the field of information retrieval, the reverse index is actually called: Inverted index. Domestic often translated into inverted index. Like most people, confused the noun at first. So, I prefer to translate it into a reverse index. The reason is called inverted, it should be because of the existence of positive.

However, the solution domain model determines that we will use the reverse index structure instead of using forward. Maybe that's why most of the books for the information retrieval class do not mention the positive index.

Implementing reverse Indexing

Whether you implement a forward or reverse index, you need to extract its characteristics from the information. Different types of information show different characteristics.

For textual information, we believe that "the importance of a word depends on how often it appears in the document" (Luhn in 1958). This means that when you query, the frequency with which the query word appears in the document determines the importance of the document.

Based on this, the implementation of the text information feature extraction, we seem to simply have to use all the text of the word as an index item (the term is referred to as the terms) just fine. But the reality is not so simple. This process is called participle (tokenizing). Just this process, different people or frameworks are divided into several links.

We can understand that it is such a process, such as the existence of two information:

1. The quick brown fox jumped over the lazy dog

2. Quick Brown foxes leap over-lazy dogs in summer

The result of using a word breaker is:

When we search for "quick brown", we get the result:

However, using different word breakers, you will get different indexes and ultimately affect your search results. A careful classmate will see that there is something wrong with the reverse index above. Therefore, when indexing, make sure to choose the appropriate word breaker.

Examples from this section are from the Elasticsearch:the definitive guide

Summary

Starting with understanding the problem domain, we derive the nature of the index-the characteristics of the information-in one step. With this understanding, we can easily understand why the current search engine is so, and not lost in the labyrinth of Knowledge maze.

However, for information retrieval, in addition to indexing this solution domain model, we have no other way out? This is a question worth thinking about.

This paper is to say that information retrieval, in fact, more accurate should be said to be text information retrieval. For the image retrieval and voice retrieval, we can not use word breaker to deal with, then how to do? Don't forget our solution domain model: Tell the Machine how to extract (or one-by-one) the characteristics (index) of the information, and then the machine takes these "indexes" in contrast to the characteristics (keywords) of the information in the brain of the searchers to know what the user is looking for. For image retrieval and speech retrieval, all we have to do is find a way to extract their own information features from pictures and voices.

[Reprint] The nature of index in information retrieval

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.