A summary of Chinese word vector papers (II.)

Source: Internet
Author: User

Recently in the Chinese word vector related work, which looked at a number of Chinese word vector related papers, in this article, will be in recent years, the Chinese word vector progress and its model structure to brief, probably to write 3-4 reviews, each containing 2-3 articles. A summary of---Chinese word vector thesis (i).

First, Improve Chinese Word embeddings by exploiting Internal structure paper source

This is a 2016 paper published in NAACL-HLT(Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies) the Conference, the author from the Chinese University of Science and technology---Jian Xu.


The two papers already mentioned above indicate that Chinese characters contain rich semantic information, which plays an important role in the expression of Chinese word vectors, and this paper is also based on the relevant work.
Specifically, it is based on the previous CWE模型 , although CWE has taken into account the internal composition of the word, adding to the expression of semantic information, however, but ignores a number of questions, between each word and their constituent (word), cwe the word and the words between the contribution as consistent, this paper proposed that their 贡献度应该是不同的, CWE ignores this problem, this article uses the external language to obtain the semantic information, calculates the similarity between the words and the word to express the difference of its contribution, and consummates the related work.
This paper puts forward the method of joint learning word and word, which can eliminate the ambiguity of Chinese characters, and can distinguish the non-meaningful composition of the words, and the experimental results show that the validity of the method is Word Similarity Text Classification verified.

Methodology and Model

The method proposed in this paper can be divided into the following stages,,, Obtain translations of Chinese words and characters Perform Chinese character sense disambiguation Learn word and character embeddings with our model .

Obtain translations of Chinese words and characters

The Chinese language training corpus uses the Word segmentation tool to make word segmentation, the Word segmentation tool can use JIEBA,ZPAR,THULAC, the word after the segmentation of the data for POS tagging ( Part-of-Speech tagging ), the purpose of POS tagging is to identify all entities (here, the entity should be part of speech), because the entity word is not semantic information, These words are defined as non-compositional word meaning that the internal composition of the word is meaningless.
Here is the use of Word frequency to do the following one of the screening, proposed to calculate the number of words inside the word, I call word frequency, word frequency the lower of those words are identified as single-morpheme multi-character words (like wandering, pipa this kind of words, in which the word is difficult to use in other words), defined as non-compositional word .
The next step is unexpected, 把中文词语翻译成了英文 but it does not contain meaningless words (non-compositional word), translated into English for the following work--- Perform Chinese character sense disambiguation .

Perform Chinese character Sense disambiguation

The work here is mainly to the Chinese word polysemy ambiguity, the above-mentioned English corpus, through the CBOW model of the corpus training, to get an English word vector, the difference is not very large characters to merge.
In Chinese, the same words and characters, although applied to different parts of speech, but the semantic information to be expressed is the same. Therefore, these are merged into one, sharing a semantic representation. For example, multiple words may differ only between different parts of speech, while semantic information is almost identical.

By calculating the similarity to eliminate ambiguity, the specific formula is as follows, where C_i,c_j represents a word in the words, Trans (c_i) represents the word in English, stop-words (en) represents the English language, X is the English in Trans, specifically, see, for 音乐the word, c_1, says c_2 says Trans (c_2) is a collection of music in English, x_3 is pleasure or enjoyment.

Based on the formula you can calculate the similarity between the words, if the value exceeds a certain threshold, combined into the same semantic representation. It also simplifies the process of translating a word into multiple English, one of which is to take the average of the words in English and then calculate the similarity, and the other is to choose the most similarity value in all the candidate English words, according to the experiment, the latter is better. According to the similarity degree, we can solve the ambiguity of the word simply.
If Max (Sim (x_t, C_k)) > W, C_k is the first word in x_t, so x_t is defined as, for compositional word compositional word definition, as for 音乐 this word, it is defined as ("music", {Sim ("Music "," sound "), Sim (" Music ","

Learn word and character embeddings with our model---SCWE

is the CWE and SCWE model diagram given in the paper, according to the above stages and the SCWE model diagram, it should be possible to understand the intention of this paper.

The vectors in the scwe morphemes are represented as,

On the basis of SCWE, and put forward the SCWE+M model, and Scwe almost, just according to the last element of F provided above, for the different meanings of the word adopt different character embedding, the specific word vector is expressed as.

Experiment Result

In Word Similarity and Text Classification on the validation of its validity, Word similarity also uses the evaluation file wordsim-240 , wordsim-296 which the Text classification uses Fudan Corpus , with specific experimental results such as:

Second, multi-granularity Chinese Word embedding paper Source

This is a 2016 paper published in EMNLP(Empirical Methods in Natural Language Processing) the Conference, the author from the Information Content security Technology National Engineering Laboratory---Jin RONGSU.


Compared with English and other Western languages, a Chinese word usually has a lot of individual Chinese characters, Chinese characters can be decomposed into many components, the radicals are one of the components, and its rich semantic information can express the meaning of the word, in the existing Chinese word vector model, there is no full use of this feature. Based on this, the multi-granularity embedding (MGE) model is presented, and its core idea is to make full use of its word-character-radical components, more fine-grained combination character and radical(部首) to enhance the expression of the word vector. word similarity analogical reasoning Validation of its validity on and on the task.


The purpose of MGE is to combine learning word,character,radical, the structure of the model is based on Cbow, as shown, where the blue part is the context, the green part is the character of the contextual word, and the yellow part is the radical of the target word. According to the example in the figure, the given sequence is: ”回家,吃饭,会友“ the target word is 会友 .

The specific representation is as follows, MGE's objective function, such as,

H_i is a hidden layer representation, specifically for example, for each context of the word, all of its character embedding sum average, and then addition with Word embedding, and then sum up all the context words to take the average, This completes the combination of word and character, for the target word all the radical is the same sum to take the average operation, and then word+character and radical again to the average, which completes the h_ of the hidden layer representation, As for the combination of word and character, the middle of the operation symbol, can be addition, or concatenation, this paper uses the addition operation.

Also with CWE模型 the existence of the same problem, a word polysemy, transliteration words, such as character meaningless words, follow the CWE approach, proposed MGE+P模型 , the purpose and CWE, increase its location information, Begin,Middle,End .

Experiment Result

Word Similarity Analogical Reasoning its validity was verified on and on.
Word SimilarityThe same use of the evaluation document is wordsim-240 , wordsim-296 but it has been a certain limitation, the two evaluation documents did not appear in the training corpus of the words have been deleted, respectively, a word and three words were deleted, and obtained two new evaluation documents,, wordsim-239 wordsim-293 concrete experimental results such as.

Analogical ReasoningUsing the evaluation document of Chen's 2015 structure, as all the words are included in the training corpus, the data is not truncated and the results of the experiments are as follows.


[1] Improve Chinese Word embeddings by exploiting Internal Structure
[2] multi-granularity Chinese Word embedding

Personal information

[1] blog:bamtercelboo.github.io/
[2] Github:github.com/bamtercelboo
[3] know: www.zhihu.com/people/bamtercelboo/activities
[4] Blog Park: http://www.cnblogs.com/bamtercelboo/

Please indicate the source

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.