Neural probabilistic language Model __ Neural network

Source: Internet
Author: User
A Neural Probabilistic Language Model Neural Probabilistic language model Original thesis Address:

Http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf Author:

Yoshua Bengio
Rejean Ducharme
Pascal Vincent
Chiristian Jauvin Summary

The goal of the statistical language model is to learn the joint probability function of a word sequence in a language, but it becomes difficult because of the problem of dimensionality catastrophe: In this model, a lexical sequence is tested to be different from all the word sequences that appear in the previous training. A traditional but successful approach based on the N-ary grammar model is that it can be generalized by the Association of short word sequences overlapping with the training set. We propose to solve the problem of dimensionality catastrophe by learning the word vector of words so that each sentence tells the model the exponential of the semantic approximate sentence. The model also learns the word vector of each word and the probability function of the word sequence, expressed in this way. A sequence of words that has never been seen before, and if it has a similar word (similar in meaning) to the sentence we have seen, it will have a higher probability, so that it will gain generalization. It is challenging to train such a large model (with millions of parameters) within a reasonable time. The report that we use neural networks to compute probability functions shows that the method presented in two text corpora significantly improves the most advanced n-ary syntax model, and this method allows for longer contexts.
keywords : Statistical language model, artificial neural network, Word vector, dimensionality disaster 1. Introduction

A fundamental problem that makes language models and learning problems difficult is the dimensionality catastrophe, especially when we want to model the joint distribution of many discrete random variables (such as words in sentences, or discrete attributes in a data mining task). A chestnut, if one wants to model the joint distribution of 10 consecutive words in natural language, the Vocabulary v V is 100,000 100,000, then there is 10000010−1=1050–1 100000^{10}-1 = 10^{50}–1 free parameters. When modeling continuous variables, we are more likely to get generalizations (for example, smoothing functions like multilayer neural networks or Gaussian mixture models), because the function to be learned is expected to have a locally smooth nature. For discrete space, the generalization structure is not so obvious: any change of these discrete variables can have a huge effect on the function value, and when the number of values that each discrete variable can give becomes larger, the Hamming distance of most objects is almost the farthest.

From the point of view of nonparametric density estimation, a useful way of visualizing how different learning algorithms can be generalized is to consider how the probability mass of the initial concentration on training points (such as training sentences) is distributed in larger volumes, usually some form of adjacent training points. In high dimensions, it is the probability mass distribution that is important, not the uniform distribution in every direction around each training point. As we'll show in this article, the approach presented here is quite different from the way the previous state-of-the-art statistical language modeling methods were generalized.
A statistical language model can be represented by the conditional probability of the next word in a given previous word.

Wt w_t is the T-word, and the subsequence is wji= (WI,WI+1,⋅⋅⋅,WJ−1,WJ). W_i^j = (w_i,w_{i+1},,w_{j-1},w_j). This statistical language model is useful in many technical applications involving natural languages, such as speech recognition, language translation and information retrieval. The improvement of the statistical language model will have a significant impact on these applications.

When establishing statistical models of natural languages, the difficulty in reducing this modeling problem is that the word order in the word sequence is more closely related to the word in time. Therefore, the N-ary grammar model constructs a conditional probability table for the next word using a large number of n-1 words in the text:

We only consider the combination of consecutive words that actually appear in the training corpus, or the combination of words that often appears. What happens when a new combination of n words does not appear in a training corpus. We don't want the probability of this situation to be 0 because it can happen and they will appear frequently in the larger context. A simple answer is to look at the probability of using a smaller context prediction, such as a back ternary model (katz,1987) or a smoothing (or interpolation) ternary model (Jelinek and mercer,1980). So, in such a model, how to generalize from a word sequence appearing in a training corpus to a new word sequence. The way to understand how this happens is to consider the build model that corresponds to the n-ary syntax model for these inserts or backtracking. In essence, a new word sequence is made up of 1, 2 of the length often appearing in the training data. Or a very short, overlapping fragment of up to n words, glued together to produce. The rules used to obtain the probability of the next fragment are implied in the details of backtracking or inserting the N-ary syntax algorithm. Usually researchers use n=3, or triples, and get the most advanced results, but see Goodman (2001) combined with many techniques to produce substantial improvements. Obviously, there are more information predictions before words in a sequence, not just the identities of the first few words. At least two feature requirements have been improved in this approach, which we will focus on in this article. First, it does not take into account the context of more than 1 or 2 words, and secondly does not take into account the similarity between words. For example, in a training corpus, see the sentence "The cat is waljing in the bedroom", we will feel that "a dog was running in A room" is possible, because "dog" and "Cat" (and "the" and "a", "room" and "bedroom") have similar semantic and grammatical functions.

Many methods have been put forward to solve these two problems, and we will briefly describe the relationship between the methods presented here and the previous methods in section 1.2. We will first discuss the basic idea of this method. The 2nd part will be introduced more formally, using the method of the multilayer neural network which relies on the shared parameter to realize. Another contribution of this article involves the challenge of training such a very large neural network (millions of parameters) with very large datasets (examples of millions of or tens of millions of). Finally, an important contribution of this paper is to show that the training of this large-scale model is costly, but it is feasible, can be extended to a larger context, and a good comparison result (section 4th).
Many of the operations in this article are matrix notation, lowercase v represents the column vector, and V ' is transpose it, and Aj A_j is the J line of Matrix A, x.y=x′y x.y = X ' Y. 1.1 Solving dimension disasters with distributed representations:

In short, the idea of this method can be summed up as follows: (1) linking each word in the thesaurus to a distributed word eigenvector (a vector of real values in RM). (2) the joint probability function of the word sequence is expressed according to the eigenvector of the words in the sequence, and (3) The parameters of the word eigenvector and the probability function are also studied.

A eigenvector represents a different aspect of a word: Each word is associated with a point in the vector space. The number of features (e.g., M = 30,60 or 100 in the experiment) is much smaller than the vocabulary (for example, 17,000). A probability function is a product of the conditional probability of a given word's next word (for example, using a multilayer neural network in an experiment to predict the next word in the case of a previous word). The function has parameters that can be adjusted iteratively to obtain the maximum logarithmic likelihood estimate or regularization criteria for the training data, for example, by adding weights to less errors. The eigenvectors associated with each word can be learned, but they can be initialized with knowledge of the semantic features we have discovered.

Why this method works. In the previous example, if we know that dog and cat play a similar role (semantically and syntactically) in a sentence, but for (The,a), (Bedroom,romm), (Is,was), (running,walking), we can naturally
The cat is walking in the bedroom
extended to (i.e. transmission probability quality):
A dog is running in a room
There are also
The cat is running in a room
A dog is walking in a bedroom
The dog was walking in the room
...
and more likely. In the proposed model, because similar words have similar eigenvector, and the probability function is the smoothing function of these eigenvalues, the difference of the characteristic value can only cause a small change in probability. Therefore, only one of the above sentences in the training data will not only increase the probability of the sentence, but also increase the probability of its similar sentence (represented by the eigenvector sequence). 1.2 Relationship to previous work

The idea of using neural networks to simulate high dimensional discrete distributions has been found to be helpful in learning Z1 ... Zn z_1 ... Z_n joint probability, which is a set of random variables, each of which may have different properties (Bengio and Bengio,2000a,b). In this model, the joint probability is decomposed into the product of conditional probability

where G () g () is a function represented by a special neural network with a left to right structure, given the z′z ' in any order, and g_i () with I output GI () to represent the distribution probability of the Zi z_i. Experiments of four UCI datasets show that this method is relatively effective (Bengio and bengio,2000a,b). Here, we have to deal with variable-length data such as sentences, so the above methods must be modified. Another important difference is that all of the Zi z_i (the words in position I) refer to the same type of object (a word). Therefore, the model proposed here introduces sharing across time parameters--using the same gi--across time--that is, input across different locations. The same idea, coupled with the idea of a distributed representation of symbolic data that was advocated early in the linkage mechanism, has also been successful in large-scale applications (hinton,1986,elman,1990). Recently, Hinton's approach has been improved, successfully demonstrating the learning of several characteristic relationships (Paccanaro and hinton,2000). The idea of using neural networks for language modeling is not recently proposed (e.g., Miikkulainen and dyer,1991). In contrast, here we generalize the idea and concentrate on learning a statistical model to analyze the sequence of words, rather than learning the role of words in sentences. The method presented here is also related to the previous method of using character-based text compression and predicting the probability of the next character using a neural network (schmidhuber,1996). Xu and Rudnicky (2000) have also independently proposed the idea of using neural networks for language modeling, although the experiment uses a network with no hidden neurons and a single input word, which fundamentally limits the model's capture of single words and two-word statistics.

The idea of finding some similarities between words, from training sequences to new sequences, is not new. For example, it has been applied in methods based on learning Word aggregation classes (Brown et al., 1992, Pereira et al., 1993, Niesler et al., 1998, Baker and McCallum, 1998): Each word and discrete class is indeed The words in the same class are similar in some ways, with or with certain probability associations. In the model proposed here, we do not use discrete random or deterministic variables (corresponding to the soft or hard partition of the group) to denote the similarity of the word, but to use a continuous real vector, or a word vector, to denote the similarity between the words. The experimental results are compared in this article, including a class based n-ary syntax model (Brown et al., 1992, Ney and Kneser, 1993, Niesler et al., 1998).

The idea of using vector space is already well applied in the field of information retrieval (for example view work by Schutze, 1993), which learns the eigenvector of a word based on the probability of simultaneous occurrence of the same file (latent semantic Indexing, Deerwester et al., 1990). An important difference is that here we look for a representation of the probability distribution that helps to express the word sequence in the natural language text. Experiments show that it is very useful to study the expression of words (word characteristics) and models at the same time. We tried (without success) to use the words around the word w to appear the first principal component of the frequency as the word characteristics of each word W. This is similar to a document that uses LSI for information retrieval. However, Bellegarda (1997), based on the N-gram statistical language model, uses LSI to dynamically identify discourse themes and successfully

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.