From frequency to meaning: semantic vector space Model (4) (from Frequency to meaning:vector space Models of semantics)

Source: Internet
Author: User

Peter D. Turney,Patrick Pantel

Translation: South China Normal University-Wu Yunyu


2. Semantic vector space model

The statistical semantic hypothesis (statistical semantics hypothesis) is the unifying theme of the various vsms discussed in this paper: the statistical patterns of human word usage can be used to indicate human meaning (statistical patterns of Human word usage can be used to figure out what people mean). This general hypothesis is the basis of many specific hypotheses, such as the word bag model hypothesis (bag of words hypothesis), the distribution hypothesis (distributional hypothesis), and the extended distribution hypothesis (extended distributional hypothesis), and the underlying relationship hypothesis (latent relation hypothesis). These are discussed below.


2.1 Document Similarity: item-Document Matrix

in this article, we use the following symbolic conventions: boldface capital letters represent matrices:A. Bold lowercase letter denotes vector:b. And the scalar is indicated in lowercase Italian,c.
If we have a large collection of documents, that is, a lot of document vectors at the same time, it is easy to organize this heap of vectors into matrices. The line vector of a matrix is treated as an item (term), (usually, an item is a word, but we also consider other possibilities (such as word-to-translator), and column vectors as documents (such as Web pages). This type of matrix is called the item-document matrix.
in mathematics, a bag (also called a multiset (multiset)) is much like a collection, but it allows repetition. For example, {a,a,b,c,c,c} is a bag containing a, B and C. In bags and combinations, the order is irrelevant; the bag {a,a,b,c,c,c} and {c,a,c,b,a,c} are equal. By stipulating that the first element in X is the number of a in the bag, the second element is the number of B in the bag, and the third element is the number of C in the bag, we characterize the bag {a,a,b,c,c,c} as a vector x=<2,1,3>. A collection of bags is also characterized by a matrix x, where each column x:j as a bag, each row XI: As a single number, the element xij is the frequency of the J-Bag about I (frequency).
in an item-document matrix, a document vector characterizes a document related to a word bag. In information retrieval, the word Bag model hypothesis (bag of words hypothesis) is one such hypothesis: we can calculate the relevance of a document and a query by representing the query and the document as a word bag. The word Bag model hypothesis (bag of words hypothesis) is the basis for VSM application in Information retrieval (Salton et al., 1975). The hypothesis believes that an item-a column vector in a document (to some extent) can capture an aspect of the meaning of a document;
make X an item-the document matrix. Suppose our Document Set includes n documents and m non-duplicates. The matrix X therefore has m rows (each line is a dictionary) and n columns (each column represents a document). Make WI The first item in the dictionary, the DJ is the article J in the document set. The first line in X is line vector XI: Column J is the column vector x:j. Row Vector XI: contains n elements, each corresponding to each document; The column vector x:j consists of M elements, each corresponding to each item. Assume that x is a simple frequency matrix. The element in X Xij is the first item of WI in the frequency of the J document DJ.
in summary, most of the elements in X are 0 (the matrix is sparse), because many documents use only a small portion of the entire dictionary. If we randomly pick an item WI and a document DJ, it is likely that WI does not appear in the DJ, so Xij equals 0.
XI: The number pattern (pattern of numbers) is the signature of item I WI (signature); Similarly, X:j is the signature of the J-document DJ. That is, the number patterns tell us, to some extent, what the items and documents are about.
Vector x:j may be a fairly coarse representation of a document DJ. Tells us the frequency of the words in the document, but loses the sequence order of the words (sequential order). Vectors do not attempt to capture the structure of phrases, sentences, paragraphs, and chapters of documents (phrases, sentences, paragraphs, and chapters of the document). Of course, the search engine works very well despite the rough, and the vectors seem to have captured the important parts of semantics.
Salton's VSM (1975) can be said to be the first practical, useful algorithm for extracting semantic information from lexical usage. An intuitive justification for the item-document matrix is that the topic of the document (topic) has a probabilistic effect on the choice of words when the author is writing the document (the reason is very similar to the topic model-Translator Note). If two documents have similar topics, then the two column vectors associated with this tend to have similar number patterns.

From frequency to meaning: semantic vector space Model (4) (from Frequency to meaning:vector space Models of semantics)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.