Introduction to LSI (Latent Semantic Indexing) Implicit semantic Indexes

Source: Internet
Author: User

1. Introduction

Words (terms) in natural language texts are characteristic of polysemy and synonymy.
Because of the multiple meanings of a word, the exact match-based search algorithm reports what many users don't want,
The exact match-based search algorithm misses many things that users want.

The following is an example:

Doc1, doc2, and doc3 are three files. The following table lists the usage of some terms in these three files:

Doc1 doc2 doc3
--------------------------------------------------
Access x

Document x
Retrieval x
Information x *
Theory X
Database x
Indexing x
Computer x *
--------------------------------------------------

Here, we assume that "information" and "computer" are used as the keywords for retrieval, then doc2 and doc3 match
The exact match is selected. However, doc2 is a file that the user does not want, and doc1 cannot be found,
If you do not want to check it out, it means that exact matching does not reflect the user's intention well. Is there any better?
How can this problem be solved?

Of course, if we can do this based on natural language understanding, there will be no problems. The problem is:
(1) The current level of natural language understanding is still limited; (2) it is very inefficient to use natural language understanding.
We hope to find a way to reflect the Inherent Relevance between terms and improve efficiency.

Bellcore's research team headed by Dumais proposed a method called "implicit semantic index", trying to bypass
In natural language understanding, statistical methods are used to achieve the same goal. The English name of "recessive semantic index" is
"Latent Semantic Indexing", LSI for short.

2. LSI practices

First, use the term (Terms) as the row, and use the documents as the column to create a large matrix.
Column D in the T row. The element in the matrix named X. matrix indicates the occurrence frequency of the term in the file.

Mathematical Proof: X can be divided into three matrices T0, S0, d0' (transpose of D0), where t0 and D0
All column vectors are orthogonal normalization. S0 is the diagonal matrix. t0 is the T * M matrix, S0 is the M * M matrix, and D0 is the D * M matrix,
M is the rank of X. This decomposition is called singlar value decomposition (SVD ).

X = T0 * S0 * d0'

Generally, T0, S0, and D0 are full-rank. It is not difficult to arrange the elements of S0 along the diagonal lines from large to small.

Now, we can keep the first K m pairs of S0 and then set the M-K pairs to 0 to obtain a new approximation.
Decomposition:

Xhat = T * S * d'

What's amazing is that xhat is the best approximation of X in the sense of least square! In this way, we actually have a "dimensionality reduction"
Path. The following describes the important application values of the three matrices T, S, and D in file search.

A legacy problem is how big the K gets. The larger the k is, the smaller the distortion is, but the larger the overhead is. K's choice is based on the actual problem.
Results requiring a balance.

Given the matrix X, based on X, you can ask three questions closely related to file retrieval:

(1) How are the terms I and J similar?

(2) How are the similarities between file I and j?

(3) How are terms I related to document J?

The first type of problems is the analogy and clustering of terms;

The second type of problem is file analogy and clustering;

The third type of problem is the association of terms and files.

We will use xhat to compare these three types.

3.1 compare two terms

Perform "Forward" multiplication:

Xhat * xhat '= T * S * d' * D * S * t' = T * s ^ 2 * t'

(D '* D = I, because D is orthogonal to return one). Its line I list J shows the similarity between terms I and J.

3.2 compare two files

Perform "inverse" multiplication:

Xhat '* xhat = D * S * t' * T * S * d' = D * s ^ 2 * d'

(T '* t = I, because T is already orthogonal to the first). Its line I list J shows the similarity between file I and j.

3.3 compare a file and a term

It happens to be xhat itself. Its line I list J shows the degree of association between term I and file J.

3.4 Query

The set of keywords can be considered as a virtual file. The task of querying is to put this virtual text
Files and other files for Similarity comparison, pick out the most similar (to what extent to pick up, depending on the actual problem
Requirements)

*********************

The above section briefly introduces the LSI practice. Experiments show that this practice can be very effective.
There are good prospects for Content Query and information filtering.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.