Http://www.hankcs.com/nlp/ner/place-names-to-identify-actual-hmm-viterbi-role-labeling.html
Named entity recognition (Named entity recognition) is also a difficulty in natural language processing, especially in Chinese, where there are no fixed-form languages such as capitalization. Last introduced the "actual combat hmm-viterbi role labeling Chinese name recognition", this time based on similar principles, for the HANLP to achieve the Chinese address place name (NS) automatic recognition.
Principle Training
This paper trains a model and summarizes some available pattern strings for the automatic role labeling of cooked corpus, statistical character frequency of words and transfer probability of characters.
Recognition
According to the above model, using the HMM-VITERBI algorithm to mark the rough results of unfamiliar text, using Aho-corasick algorithm pattern matching to match the possible address, it is sent into the second layer hidden Markov model.
Automatic role labeling in combat training
The Chinese named entity recognition based on cascading hidden Markov model. Use the following name recognition roles in PDF:
On this basis, I built up the CDE for three words place name three words, h for the Chinese place name suffix, g for the entire address, so generally can identify up to 6 words place name (CDE place name + three words suffix), than the paper has improved.
Use a small amount of code to automate the role labeling of the cooked corpus, for example, the sentence in the People's Daily 2014 segmentation Corpus:
1 |
王先东/nr 来自/v 湖北/ns 荆门/ns ,/w 在/p 佛山市/ns [南海区/ns 大沥镇/ns]/nz 某/rz 物业公司/nis 做/v 保安/b |
Progressive processing gets
1234567 |
原始语料 [未##人/nr, 来自/v, 湖北/ns, 的/ude1, 荆门/ns, ,/w, 在/p, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/rz, 物业公司/nis, 做/v, 保安/b]
添加首尾 [始##始/S, 未##人/nr, 来自/v, 湖北/ns, 的/ude1, 荆门/ns, ,/w, 在/p, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/rz, 物业公司/nis, 做/v, 保安/b, 末##末/Z]
标注上文 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/A, 荆门/ns, ,/w, 在/A, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/rz, 物业公司/nis, 做/v, 保安/b, 末##末/Z]
标注下文 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/B, 荆门/ns, ,/B, 在/A, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/B, 物业公司/nis, 做/v, 保安/b, 末##末/Z]
标注中间 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/X, 荆门/ns, ,/B, 在/A, 乌鲁木齐市/ns, [南海区/ns 大沥镇/ns]/ns, 某/B, 物业公司/nis, 做/v, 保安/b, 末##末/Z]
拆分地名 [始##始/S, 未##人/nr, 来自/A, 湖北/ns, 的/X, 荆门/ns, ,/B, 在/A, 乌鲁木齐市/ns, 南海区/ns, 大沥镇/ns, 某/B, 物业公司/nis, 做/v, 保安/b, 末##末/Z]
处理整个 [始##始/S, 未##人/Z, 来自/A, 湖北/G, 的/X, 荆/C, 门/H, ,/B, 在/A, 乌鲁木齐/G, 市/H, 南/C, 海/D, 区/H, 大/C, 沥/D, 镇/H, 某/B, 物业公司/Z, 做/Z, 保安/Z, 末##末/Z]
|
Statistical frequency
After the automatic labeling of all the cooked corpus sentences, you can count the word frequency of each non-Z term and get a character dictionary:
1234567891011121314151617181920212223 |
位于 A 1660 X 93 B 33
位列 B 17 A 13 X 1
位居 B 25 A 14 X 1
位次 B 1
位置 B 5 A 1
低 B 9
低于 A 18 B 2
低产田 B 1
低价 B 1
低估 A 5
低保 B 3
低保户 B 3
低效 B 1
低温 B 3
低热值 B 1
低碳 B 27
低空 B 2
低调 B 5
低速 B 3
低阶煤 B 1
住 A 81 B 53
住友 B 1
住在 A 271 B 1
|
Statistical transfer matrix
Transfer matrix refers to the frequency of transferring from one role tag to another, using it and character frequency to calculate the initial probability, transfer probability and emission probability of Hmm, and then complete the solution. For the Viterbi algorithm and implementation, refer to the Java implementation of the common Viterbi algorithm.
The following transfer matrices are trained in the People's Daily 2014 segmentation Corpus:
Recognition
Example
Take "Nanxiang to Ningxia Guyuan Pengyang County Red River Town Black cattle gou cun donated excavator" As an example, when the name is not recognized, the following output will be obtained:
1 |
[南翔/ns, 向/p, 宁夏/ns, 固原市/ns, 彭/nz, 阳/ag, 县/n, 红/a, 河镇/ns, 黑/a, 牛/n, 沟/n, 村/n, 捐赠/v, 了/ule, 挖掘机/n] |
In the above example, "Ningxia" and "Guyuan" are commonly used place names, so it is included in the core dictionary, where the correct segmentation results are shown. But like "Pengyang County" "Red River Town" "Black Ox Village" and other places are very small place, not by the dictionary included, naturally also can not get the correct participle results.
Role labeling
12 |
place name role observation: [ z 41339414 ][Nanxiang h 1000 ][to A 1076 b 115 x 70 c 49 d 5 ][Ningxia h 1000 ][Guyuan h 1000 ][Peng c 85 ][Yang d 1255 c 81 b 1 ][County h 6878 b 25 a 23 d 19 x 3 c 2 ][Red c 1000 b 46 a 3 ][He zhen h 1000 ][hei c 960 b 25 ][niu d 24 c 8 b 7 ][Ditch h 107 d 90 e 36 c 27 b 14 a 3 ][Village h 4467 d 68 b 28 a 8 c 3 ][donated b 10 a 1 ][ a 4115 b 97 ][excavator b 1 ][ z 41339414 ]
地名角色标注:[ /Z ,南翔/H ,向/B ,宁夏/H ,固原市/H ,彭/C ,阳/D ,县/H ,红/C ,河镇/H ,黑/C ,牛/D ,沟/E ,村/H ,捐赠/B ,了/A ,挖掘机/B , /Z]
|
Pattern matching
Use the Aho-corasick algorithm pattern to match the following pattern string:
Get the following names:
123 |
识别出地名:彭阳县 CDH 识别出地名:红河镇 CH 识别出地名:黑牛沟村 CDEH |
The second layer of hidden horse model subdivision
In fact, this is the third layer of the hidden horse model, because the place name recognition also used a hmm, and that the output is the input. The final result is obtained after subdivision:
1 |
[南翔/ns, 向/p, 宁夏/ns, 固原市/ns, 彭阳县/ns, 红河镇/ns, 黑牛沟村/ns, 捐赠/v, 了/ule, 挖掘机/n] |
Summarize
Hmm model can solve a lot of problems, the multiple HMM models can be stacked up to play a more accurate effect.
However, the 2-dollar grammar will still be hit in the wrong situation, in fact, some high-frequency names have been included in the core dictionary and user-defined dictionary. So HANLP's default configuration turns off place-name recognition, only in some extreme cases (specifically extracting county-level addresses) for users to open.
Directory
- Principle
- Training
- Recognition
- Actual combat
- Training
- Automatic role labeling
- Statistical frequency
- Statistical transfer matrix
- Recognition
- Role labeling
- Pattern matching
- The second layer of hidden horse model subdivision
- Summarize
Reprint please specify: Yards farm» Combat Hmm-viterbi Role Labeling place name recognition
Actual combat Hmm-viterbi Role labeling place name recognition