From frequency to meaning: semantic vector space Model (4) (from Frequency to meaning:vector space Models of semantics)

Last Update:2016-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Peter D. Turney,Patrick Pantel

Translation: South China Normal University-Wu Yunyu

2. Semantic vector space model

The statistical semantic hypothesis (statistical semantics hypothesis) is the unifying theme of the various vsms discussed in this paper: the statistical patterns of human word usage can be used to indicate human meaning (statistical patterns of Human word usage can be used to figure out what people mean). This general hypothesis is the basis of many specific hypotheses, such as the word bag model hypothesis (bag of words hypothesis), the distribution hypothesis (distributional hypothesis), and the extended distribution hypothesis (extended distributional hypothesis), and the underlying relationship hypothesis (latent relation hypothesis). These are discussed below.

2.1 Document Similarity: item-Document Matrix

in this article, we use the following symbolic conventions: boldface capital letters represent matrices:A. Bold lowercase letter denotes vector:b. And the scalar is indicated in lowercase Italian,c.
If we have a large collection of documents, that is, a lot of document vectors at the same time, it is easy to organize this heap of vectors into matrices. The line vector of a matrix is treated as an item (term), (usually, an item is a word, but we also consider other possibilities (such as word-to-translator), and column vectors as documents (such as Web pages). This type of matrix is called the item-document matrix.
in mathematics, a bag (also called a multiset (multiset)) is much like a collection, but it allows repetition. For example, {a,a,b,c,c,c} is a bag containing a, B and C. In bags and combinations, the order is irrelevant; the bag {a,a,b,c,c,c} and {c,a,c,b,a,c} are equal. By stipulating that the first element in X is the number of a in the bag, the second element is the number of B in the bag, and the third element is the number of C in the bag, we characterize the bag {a,a,b,c,c,c} as a vector x=<2,1,3>. A collection of bags is also characterized by a matrix x, where each column x:j as a bag, each row XI: As a single number, the element xij is the frequency of the J-Bag about I (frequency).
in an item-document matrix, a document vector characterizes a document related to a word bag. In information retrieval, the word Bag model hypothesis (bag of words hypothesis) is one such hypothesis: we can calculate the relevance of a document and a query by representing the query and the document as a word bag. The word Bag model hypothesis (bag of words hypothesis) is the basis for VSM application in Information retrieval (Salton et al., 1975). The hypothesis believes that an item-a column vector in a document (to some extent) can capture an aspect of the meaning of a document;
make X an item-the document matrix. Suppose our Document Set includes n documents and m non-duplicates. The matrix X therefore has m rows (each line is a dictionary) and n columns (each column represents a document). Make WI The first item in the dictionary, the DJ is the article J in the document set. The first line in X is line vector XI: Column J is the column vector x:j. Row Vector XI: contains n elements, each corresponding to each document; The column vector x:j consists of M elements, each corresponding to each item. Assume that x is a simple frequency matrix. The element in X Xij is the first item of WI in the frequency of the J document DJ.
in summary, most of the elements in X are 0 (the matrix is sparse), because many documents use only a small portion of the entire dictionary. If we randomly pick an item WI and a document DJ, it is likely that WI does not appear in the DJ, so Xij equals 0.
XI: The number pattern (pattern of numbers) is the signature of item I WI (signature); Similarly, X:j is the signature of the J-document DJ. That is, the number patterns tell us, to some extent, what the items and documents are about.
Vector x:j may be a fairly coarse representation of a document DJ. Tells us the frequency of the words in the document, but loses the sequence order of the words (sequential order). Vectors do not attempt to capture the structure of phrases, sentences, paragraphs, and chapters of documents (phrases, sentences, paragraphs, and chapters of the document). Of course, the search engine works very well despite the rough, and the vectors seem to have captured the important parts of semantics.
Salton's VSM (1975) can be said to be the first practical, useful algorithm for extracting semantic information from lexical usage. An intuitive justification for the item-document matrix is that the topic of the document (topic) has a probabilistic effect on the choice of words when the author is writing the document (the reason is very similar to the topic model-Translator Note). If two documents have similar topics, then the two column vectors associated with this tend to have similar number patterns.

From frequency to meaning: semantic vector space Model (4) (from Frequency to meaning:vector space Models of semantics)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

From frequency to meaning: semantic vector space Model (4) (from Frequency to meaning:vector space Models of semantics)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

From frequency to meaning: semantic vector space Model (4) (from Frequency to meaning:vector space Models of semantics)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support