Paper read A simple but tough-to-beat BASELINE for sen-tence embeddings

Source: Internet
Author: User
Tags idf

This paper presents the SIF sentence embedding method, the author provides the code on GitHub.

Introduced

As a method of unsupervised computation of similarity between sentences, the SIF sentence embedding uses the pre-trained word vectors, uses the weighted average method, calculates the word vectors corresponding to all the words in the sentence, Get the embedding vector of the whole sentence. Then we use the sentence vector to calculate the similarity degree.

Before this paper, there is a very similar idea to this article, that is, the use of the word vector , through the average method to get the sentence vector , only in the weighted weight calculation method of the difference. Specifically, there are:

    • The average of all the words in the sentence, the weight of each word is the same, get sentence embedding
    • Use the tf-idf value of each word as weight, weighted average, to get sentence embedding

This paper uses smooth inverse frequency, the SIF as the weight of each word, instead of the tf-idf value, get a better effect. In addition to the use of the new Word weight calculation method, but also after the weighted average, minus the principal component, and finally get the sentence embedding.

In addition, the robustness of this method is also mentioned in the paper:

    • Using different corpus (various fields) to train the different word embedding, have achieved very good effect, explained the friendly to various corpus.
    • Using different corpus to obtain the word frequency , as the factor of calculating the weight of words, the final result has little effect.
    • For the hyper- parameters in the method, the results obtained in a wide range are always the region, that is, the selection of the super-parameter has not much effect.
Theory 1. Build model

First, we start with the latent variable generation model (latent variable generative models). This model assumes that the creation of a corpus is a dynamic process, in which the first \ (t\) Word is generated in the section \ (t\) step.

Each word \ (w\) corresponds to the vector of a \ (\mathbb{r}^d\) dimension. This dynamic process is driven by the random walk of discourse vector\ (c_t\in{\mathbb{r}^d}\) . The discourse vector represents this sentence what's being talkedabout, as a latent variable , represents a state of a sentence, because it is dynamic, This state is changed over time and is therefore recorded as \ (c_t\).

The word \ (w\) vector \ (v_w\) is the inner product of the discourse vector\ (c_t\) with the current time, Indicates the relationship between the word and the whole sentence. And we assume that \ (t\) moment observes the probability of the word \ (w\) for the logarithmic linear (log linear) relationship of this inner product:

\[PR (\text{w emitted at time t}| c_t) \propto{\exp (\langle c_t,v_w \rangle)}\]

Because \ (c_t\) is obtained by a smaller random walk , \ (c_t\) and \ (c_{t+1}\) will only have a smaller random difference vector, so The adjacent words are generated by the approximate discourse vector . It is also calculated that the random walk of this model allows the occasional c_t\ to have a larger jump, which has a small effect on the symbiosis probability.

The vector of words generated by this method is similar to the vector generated by Word2vec (Cbow) and Glove .

2. Improvement of random walk model

With the above model, we would like to get a sentence of my sentence embedding: Maximum likelihood estimation for discourse vectors . To simplify, notice that \ (c_t\) changes very little during the whole sentence generation, so we assume that the discourse vector of all steps is a fixed one (c_s\). It can be proved that the maximum likelihood estimate of \ (c_s\) is the average of the embedding vectors of all the words.

This paper has improved this model by adding two smoothing items , for the following considerations:

    • Some words appear outside the specified context, and may have an effect on the discourse vector .
    • The advent of finite words (such as common stop words) is not related to discourse vectors .

For these two considerations, two smoothing items are introduced, first in a logarithmic linear model (additive term)\ (\alpha p (w) \), where \ (P (w) \) is the word \ (w\) the probability of occurrence in the entire corpus (word frequency angle), \ (\alpha\) is a hyper-parameter . In this way, even if the inner product of and \ (c_s\) is small, the word also has a probability of appearing.

Then, introduce a correction item, common discourse vector\ (c_0\in{\mathbb{r}^d}\), whose meaning is the most frequent meaning of the sentence, which can be thought of as the sentence The most important ingredient can often be associated with grammar . The article argues that for a word that has a larger component (i.e. a longer vector projection) along the \ (c_0\) direction, the correction will increase the probability of the word appearing.

After correction, the probability that the word \ (w\) appears in the sentence \ (s\) for a given discourse vector\ (c_s\) is:

\[PR (\text{w emitted in sentence s}| c_s) \propto{\alpha P (w) + (1-\alpha) \frac{\exp (\langle \tilde{c}_s, V_w \rangle)}{Z_ {\tilde{c}_s}}} \]

where \ (\tilde{c}_s=\beta c_0+ (1-\beta) c_s,\ c_0\perp c_s\), \ (\alpha\) and \ (\beta\) are all hyper- parameters , \ (Z_{\tilde{c}_s}=\sum\limits_{w\in{v}}\exp (\langle \tilde{c}_s, v_w \rangle) \) is a normalized constant . As you can see from the formula, a word \ (w\)that has no relation to \ (c_s\) can also appear in a sentence for the following reasons:

    • Values from \ (\alpha p (w) \) Items
    • Correlation with common discourse vector \ (c_0\)
3. Calculating sentence vectors

The sentence vector is the \ (c_s\)in the above model, using the maximum likelihood method to estimate the \ (c_s\) vector. First, assume that the vector \ (v_s\) of all the words is roughly evenly distributed across the entire vector space, so the normalized item \ (z_c\) here is roughly the same for different sentence values, that is, for arbitrary \ (\ tilde{c}_s\), the \ (z\) value is the same. In this premise, the likelihood function is obtained:

\[p[s|c_s]=\prod\limits_{w\in{s}}p (w|c_s) =\prod\limits_{w\in{s}}[\alpha P (w) + (1-\alpha) \frac{\exp (\langle \tilde {c}_s, v_w \rangle)}{z}]\]

Take the logarithm, the single word is recorded as

\[f_w (\tilde{c}_s) =\log[\alpha P (w) + (1-\alpha) \frac{\exp (\langle \tilde{c}_s, V_w \rangle)}{z}]\]

Maximize the above, specific to the paper in the detailed description, the ultimate goal is:

\[\arg\max\limits_{c_s}\sum\limits_{w\in{s}}f_w (\tilde{c}_s) \]

Can get:

\[\tilde{c}_s\propto \sum\limits_{w\in{s}}\frac{a}{p (w) +a}v_w,\ A=\frac{1-\alpha}{\alpha Z}\]

So you can get:

    • The optimal solution is the weighted average of all the word vectors in the sentence
    • For words with a higher frequency of word \ (w\), the weight value is small, so this method is also equivalent to the next sampling frequent words

Finally, in order to get the final sentence vector \ (c_s\), we need to estimate \ (c_0\). By calculating the first principal of the Vector \ (\tilde{c}_s\) component(principal component inPCA ), it is used as \ (c_0\). The final sentence vector is \ (\tilde{c}_s\) minus the main component vector \ (c_0\).

4. Algorithm Summary

The entire algorithm steps are summarized as:

Paper read A simple but tough-to-beat BASELINE for sen-tence embeddings

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.