Actual combat Hmm-viterbi Role labeling place name recognition

Source: Internet
Author: User

Http://www.hankcs.com/nlp/ner/place-names-to-identify-actual-hmm-viterbi-role-labeling.html

Named entity recognition (Named entity recognition) is also a difficulty in natural language processing, especially in Chinese, where there are no fixed-form languages such as capitalization. Last introduced the "actual combat hmm-viterbi role labeling Chinese name recognition", this time based on similar principles, for the HANLP to achieve the Chinese address place name (NS) automatic recognition.

Principle Training

This paper trains a model and summarizes some available pattern strings for the automatic role labeling of cooked corpus, statistical character frequency of words and transfer probability of characters.

Recognition

According to the above model, using the HMM-VITERBI algorithm to mark the rough results of unfamiliar text, using Aho-corasick algorithm pattern matching to match the possible address, it is sent into the second layer hidden Markov model.

Automatic role labeling in combat training

The Chinese named entity recognition based on cascading hidden Markov model. Use the following name recognition roles in PDF:

On this basis, I built up the CDE for three words place name three words, h for the Chinese place name suffix, g for the entire address, so generally can identify up to 6 words place name (CDE place name + three words suffix), than the paper has improved.

Use a small amount of code to automate the role labeling of the cooked corpus, for example, the sentence in the People's Daily 2014 segmentation Corpus:

1 王先东/nr 来自/v 湖北/ns 荆门/ns ,/w 在/p 佛山市/ns [南海区/ns 大沥镇/ns]/nz 某/rz 物业公司/nis 做/v 保安/b

Progressive processing gets

1234567 原始语料 [未##人/nr, 来自/v, 湖北/ns, 的/ude1, 荆门/ns, ,/w, 在/p, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/rz, 物业公司/nis, 做/v, 保安/b]添加首尾 [始##始/S, 未##人/nr, 来自/v, 湖北/ns, 的/ude1, 荆门/ns, ,/w, 在/p, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/rz, 物业公司/nis, 做/v, 保安/b, 末##末/Z]标注上文 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/A, 荆门/ns, ,/w, 在/A, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/rz, 物业公司/nis, 做/v, 保安/b, 末##末/Z]标注下文 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/B, 荆门/ns, ,/B, 在/A, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/B, 物业公司/nis, 做/v, 保安/b, 末##末/Z]标注中间 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/X, 荆门/ns, ,/B, 在/A, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/B, 物业公司/nis, 做/v, 保安/b, 末##末/Z]拆分地名 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/X, 荆门/ns, ,/B, 在/A, 乌鲁木齐市/ns, 南海区/ns, 大沥镇/ns, 某/B, 物业公司/nis, 做/v, 保安/b, 末##末/Z]处理整个 [始##始/S, 未##人/Z, 来自/A, 湖北/G, 的/X, 荆/C, 门/H, ,/B, 在/A, 乌鲁木齐/G, 市/H, 南/C, 海/D, 区/H, 大/C, 沥/D, 镇/H, 某/B, 物业公司/Z, 做/Z, 保安/Z, 末##末/Z]
Statistical frequency

After the automatic labeling of all the cooked corpus sentences, you can count the word frequency of each non-Z term and get a character dictionary:

1234567891011121314151617181920212223 位于 A 1660 X 93 B 33位列 B 17 A 13 X 1位居 B 25 A 14 X 1位次 B 1位置 B 5 A 1低 B 9低于 A 18 B 2低产田 B 1低价 B 1低估 A 5低保 B 3低保户 B 3低效 B 1低温 B 3低热值 B 1低碳 B 27低空 B 2低调 B 5低速 B 3低阶煤 B 1住 A 81 B 53住友 B 1住在 A 271 B 1
Statistical transfer matrix

Transfer matrix refers to the frequency of transferring from one role tag to another, using it and character frequency to calculate the initial probability, transfer probability and emission probability of Hmm, and then complete the solution. For the Viterbi algorithm and implementation, refer to the Java implementation of the common Viterbi algorithm.

The following transfer matrices are trained in the People's Daily 2014 segmentation Corpus:

Recognition

Example

Take "Nanxiang to Ningxia Guyuan Pengyang County Red River Town Black cattle gou cun donated excavator" As an example, when the name is not recognized, the following output will be obtained:

1 [南翔/ns, 向/p, 宁夏/ns, 固原市/ns, 彭/nz, 阳/ag, 县/n, 红/a, 河镇/ns, 黑/a, 牛/n, 沟/n, 村/n, 捐赠/v, 了/ule, 挖掘机/n]

In the above example, "Ningxia" and "Guyuan" are commonly used place names, so it is included in the core dictionary, where the correct segmentation results are shown. But like "Pengyang County" "Red River Town" "Black Ox Village" and other places are very small place, not by the dictionary included, naturally also can not get the correct participle results.

Role labeling
12 place name role observation: [  z 41339414 ][Nanxiang  h 1000 ][to  A  1076 b 115 x 70 c 49 d 5 ][Ningxia  h 1000 ][Guyuan  h  1000 ][Peng  c 85 ][Yang  d 1255 c 81 b 1 ][County  h  6878 b 25 a 23 d 19 x 3 c 2 ][Red  c 1000  b 46 a 3 ][He zhen  h 1000 ][hei  c 960 b 25 ][niu  d 24 c 8 b 7 ][Ditch  h 107 d 90 e 36 c  27 b 14 a 3 ][Village  h 4467 d 68 b 28 a 8  c 3 ][donated  b 10 a 1 ][ a 4115 b 97 ][excavator  b 1 ][  z 41339414 ] 地名角色标注:[ /Z ,南翔/H ,向/B ,宁夏/H ,固原市/H ,彭/C ,阳/D ,县/H ,红/C ,河镇/H ,黑/C ,牛/D ,沟/E ,村/H ,捐赠/B ,了/A ,挖掘机/B , /Z]
Pattern matching

Use the Aho-corasick algorithm pattern to match the following pattern string:

1234         CH        CDH        CDEH        GH

Get the following names:

123 识别出地名:彭阳县 CDH识别出地名:红河镇 CH识别出地名:黑牛沟村 CDEH
The second layer of hidden horse model subdivision

In fact, this is the third layer of the hidden horse model, because the place name recognition also used a hmm, and that the output is the input. The final result is obtained after subdivision:

1 [南翔/ns, 向/p, 宁夏/ns, 固原市/ns, 彭阳县/ns, 红河镇/ns, 黑牛沟村/ns, 捐赠/v, 了/ule, 挖掘机/n]
Summarize

Hmm model can solve a lot of problems, the multiple HMM models can be stacked up to play a more accurate effect.

However, the 2-dollar grammar will still be hit in the wrong situation, in fact, some high-frequency names have been included in the core dictionary and user-defined dictionary. So HANLP's default configuration turns off place-name recognition, only in some extreme cases (specifically extracting county-level addresses) for users to open.

Directory

    • Principle
    • Training
    • Recognition
    • Actual combat
    • Training
    • Automatic role labeling
    • Statistical frequency
    • Statistical transfer matrix
    • Recognition
    • Role labeling
    • Pattern matching
    • The second layer of hidden horse model subdivision
    • Summarize

Reprint please specify: Yards farm» Combat Hmm-viterbi Role Labeling place name recognition

Actual combat Hmm-viterbi Role labeling place name recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.