Constructing Chinese probabilistic language model based on parallel neural network and Fudan Chinese corpus

Source: Internet
Author: User

This paper aims at constructing probabilistic language model of Chinese based on Fudan Chinese corpus and neural network model.

A goal of the statistical language model is to find the joint distribution of different words in the sentence, that is to find the probability of the occurrence of a word sequence, a well-trained statistical language model can be used in speech recognition, Chinese input method, machine translation and other fields. Before the neural network method was presented, a very successful method of building a language model was to n-gram,n-gram the conditional probability of a word appearing in a given word sequence, and to obtain the generalization ability of the model by splicing a series of overlapping phrases together. However, the N-gram model also has many unsatisfactory places. First, the value of n can not be too large, otherwise it will result in sparse data, second, it does not take into account the lexical and lexical similarities between the grammar and semantics, but also limits its generalization ability.

The algorithm used in this paper is based on two papers "A Neural probabilistic Languge Model" by Bengio et al @2003 and adam:a method for Stochastic optimization "By P.kingma & Lei Ba @2015. The neural network model will learn two goals at the same time: first, to learn the distributed representation of each word (distributed representation refers to the use of multiple compilation units to represent a word, that is, each word is represented by an n-dimensional real vector, whereas the traditional representation is to represent a word as a feature of a document, This feature is 1 if it contains a word, otherwise 0. The traditional expression method can only express the occurrence of words or not, can not express the distance between words, second, learn the joint distribution of word sequences. The reason for the generalization capability of this model is that similar words (both syntactically and semantically) have similar distributed representations, so similar output will be obtained as input to the model.

This neural network model has a very high number of parameters, so it is possible to implement parallel computing based on the MPI (Message passing Interface, the interface for implementing distributed parallel computing) and the Linux cluster. The ADAM algorithm is used to compare the algorithms used in Bengio to test the performance of the Adam algorithm in this data set.

1 Data Set preprocessing

1.1 Data Set Introduction

This data set is from the Chinese corpus of Fudan University. The Chinese corpus of Fudan University is divided into two parts: training set and validation set, and the number of documents in the two parts is basically equal, after merging the training set and the verification set, some basic information of the corpus is given below .

Total number of categories: 20
Total number of documents: 19637
Category name (category Code): Number of documents
Agriculture (C32): 2043 Articles
Art (C3): 1482 Articles
Communication (C17): 52 Articles
Computer (C19): 2,715 Articles
Economy (C34): 3,201 Articles
Education (C5): 120 articles
Electronics (C16): 55 Articles
Energy (C15): 65 Articles
Enviornment (C31): 2,435 articles
History (C7): 934 articles
Law (C35): 103 articles
Literature (C4): 67 Articles
Medical (C36): 104 Articles
Military (C37): 150 articles
Mine (C23): 67 Articles
Philosophy (C6): 89 articles
Politics (C38): 2050 Articles
Space (C11): 1282 Articles
Sports (C39): 2,507 articles
Transport (C29): 116 Articles

Examples of documents:

1.2 Data Set preprocessing

First of all, we need to thank the Fudan Corpus for their hard work, but also to point out the flaws in the data set:

(1) The main use of GBK encoding rather than UTF-8, but the part is not the use of GBK coding (which brings trouble to the code conversion work).
(2) The corpus contains training sets and test sets, each containing more than 9,000 documents, but many of the documents are duplicated.
(3) Some of the files in the C35-law of the training set and the test set are already processed by word breakers (the results are poor and cannot be used directly).
(4) Some articles have only the head of the article, but no actual content.

So this article is preprocessing the data set in the following steps:

(1) Use the Finddupfile tool to find duplicate files of train and answer folders, delete them, delete the total number of files after deletion is 14894. :

(2) Merge the contents of the answer folder and the train folder into the same directory. After de-duplication, this article does not use the original partitioning method (that is, training set: Test set =1:1), instead of merging all the files, to ensure that each file has a unique identifier, each classification is divided into 70% as a training set, 10% for the validation set, 20% as a test set.

(3) Use Python to convert files from GBK encoding to utf-8 encoding.

(4) Remove spaces and newline characters from all documents. Because the sentences in the original document are often separated by line breaks and spaces, even some sentences have been processed by word breakers (and the word segmentation effect is poor), resulting in the inability to recognize the sentence, so remove the symbols and then to the word segmentation. Removing spaces also makes it easy for regular expressions to match dates exactly.

The preprocessor is shown in source file: pre-proc.py.

PS: the encoding paradigm for using regular expressions to process text in Python is: regex = Re.compile (oldstring) , then re.__ (regex[, newstring], Subje CT) or regex.__ ([, newstring], subject)

The re is a module, the regex is a regular expression instance, subject is the target string: Subn represents the substitution,. Split means delimited,. Match indicates a match

If no regular expression is used, only the corresponding function defined in the String class subject.__ (oldstring, newstring)

2 Database Design

This article stores data based on MySQL. The training of the model is based on the view that the words that appear in the previous sentence also have an effect on the words that appear in the next sentence, so this article does not divide the document into sentences, but instead directly follows the order in which the words appear in the entire article, and the schema that stores the text information is designed to:

Doc (name, Class)--Represents the document entity, in this article the category field is used primarily for grouping, each grouping is selected as a test set, and other as a validation set or test set.

Word (term, vector)--represents the word entity. The vector field represents the vector of words, using the "_" delimiter to separate the elements in the vector, and the vectors to take 30,60,100 in the experiment.

Doc_term (doc,term, locatoin)--Indicates the position of the word in the document. This relational table can also be used to build search engines.

The storage Neural Network parameter schema is designed to:

W1 (start, end, weight)

W2 (start, end, weight)

W3 (start, end, weight)

B1 (end, weight)

B2 (end, weight)

The specific meaning of each table is discussed in detail when constructing a neural network.

3 participle

There are no separators between Chinese text words, so special methods are needed to slice them, and the way the text is sliced will affect the size of the word collection | v|. The task of participle is to split the document into lexical sequences, and also to deal with the following word collections: The vocabulary set will remove punctuation, map numbers into a special word *__num__*, and map particularly rare words into special words *__rare__*; map URLs to *__url__* ; Map the date to *__date__*.

In this paper, the use of open-source project stutter participle to achieve Chinese word segmentation (Project address: https://github.com/fxsjy/jieba/), and on the basis of this project to further improve, for example, based on regular expressions to achieve the date, the exact matching of URLs.

2.1 Exact Match extraction phrases

Remove extraneous characters based on regular expressions, and exact match and extract dates, URLs, mailboxes, math, etc.

2.2 Stuttering participle technology

Based on Trie tree to generate a direction-Free State transition graph using the dynamic programming method to find the maximum probability path, the segmentation statement uses the HMM model and the Viterbi algorithm based on the word-forming ability of Chinese characters to deal with the non-login words .

2.3 Participle Results

After participle, the number of different words is, vocabulary set size | v| =

4 Building a neural network

3.1 Parallel Neural network model

The regular term is entered using Adam algorithm

3.1 Word2vector pre-extracting word vectors to shorten convergence time

Explore the number of parameters

5 experiments

Learning rate 0.001, m take the 3 5 10 20 word vector initialization Word2vec

When n takes 3, the prediction effect of the model can be seen visually through the plane graph.

5 N-gram

6 Source code

Constructing Chinese probabilistic language model based on parallel neural network and Fudan Chinese corpus

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.