Chinese Word Segmentation implementation-bidirectional maximum matching and Chinese Word Segmentation matching

Source: Internet
Author: User

Chinese Word Segmentation implementation-bidirectional maximum matching and Chinese Word Segmentation matching

For more information about Chinese word segmentation, see Chinese Word Segmentation Summary In this blog. I will not go into details here.

Bidirectional maximum matching

Bidirectional maximum matching is a dictionary-based word segmentation method. The dictionary-Based Word Segmentation Method matches the Chinese character string to be analyzed with the entry in the "big machine dictionary" according to certain policies. If a string is found in the dictionary, the match is successful.

Based on different scanning directions: forward matching and reverse matching

Matching by length: maximum matching and minimum matching

Forward maximum matching concept FMM

1. Take the m characters of the Chinese sentence to be split from left to right as the matching field. m indicates the maximum number of entries in the MT dictionary.

2. Search for and match the machine dictionary. If the match is successful, the matching field is split as a word.

If the match fails, the last word of the matching field is removed, and the remaining string is used as the new matching field for re-matching. Repeat the above process until all words are split.

Reverse maximum matching algorithm BMM

This algorithm is a reverse thinking of forward maximum matching. If the matching fails, the first word of the matching field is removed. The experiment shows that the reverse maximum matching algorithm is better than the forward maximum matching algorithm.


Bi-directction Matching method (BM)

The bidirectional maximum matching method compares the word segmentation result obtained by the forward maximum matching method with the result obtained by the reverse maximum matching method to determine the correct word segmentation method. According to SunM. s. and Benjamin K. t. (1995) studies show that about 90.0% of Chinese sentences fully overlap and are correct with the forward and reverse maximum matching methods, only about 9.0% of sentences have different results obtained by the two segmentation methods, but one of them must be correct (ambiguity detection successful), with less than 1.0% of sentences, or the splitting of the forward and reverse maximum matching methods is incorrect, or the forward and reverse maximum matching methods are different but both are incorrect (ambiguity detection fails ). This is why the bidirectional maximum matching method is widely used in the Practical Chinese Information Processing System.


In the method implemented in this article, we have taken into account the maximum positive and inverse matching results, and added some heuristic rules to further eliminate word segmentation results.

Heuristic rules:

1. If the number of positive and reverse word segmentation results is different, the word segmentation count is smaller.

2. If the word segmentation result has the same number of words

A. If the word splitting result is the same, it indicates that there is no ambiguity and any one can be returned.

B. If the word splitting result is different, return the one with fewer words.


The following are specific implementations:


Package Segment; import java. io. bufferedReader; import java. io. fileInputStream; import java. io. IOException; import java. io. inputStreamReader; import java. util. hashSet; import java. util. set; import java. util. vector; public class FBSegment {private static Set <String> seg_dict; // load the dictionary public static void Init () {seg_dict = new HashSet <String> (); string dicpath = "data/worddic.txt"; String line = null; BufferedReader Br; try {br = new BufferedReader (new InputStreamReader (new FileInputStream (dicpath); while (line = br. readLine ())! = Null) {line = line. trim (); if (line. isEmpty () continue; seg_dict.add (line) ;}br. close ();} catch (IOException e) {e. printStackTrace ();}} /*** Forward Algorithm word segmentation * @ param seg_dict word segmentation dictionary * @ param phrase sentence to be segmented * @ return forward word segmentation result */private static Vector <String> FMM2 (String phrase) {int maxlen = 16; Vector <String> fmm_list = new Vector <String> (); int len_phrase = phrase. length (); int I = 0, j = 0; while (I <len_phrase) {int end = I + ma Xlen; if (end> = len_phrase) end = len_phrase; String phrase_sub = phrase. substring (I, end); for (j = phrase_sub.length (); j> = 0; j --) {if (j = 1) break; String key = phrase_sub.substring (0, j); if (seg_dict.contains (key) {fmm_list.add (key); I + = key. length ()-1; break;} if (j = 1) fmm_list.add ("" + phrase_sub.charAt (0); I + = 1;} return fmm_list ;} /*** backward algorithm word segmentation * @ param seg_dict word segmentation dictionary * @ param phrase sentence to be segmented * @ return backward score Word result */private static Vector <String> BMM2 (String phrase) {int maxlen = 16; Vector <String> bmm_list = new Vector <String> (); int len_phrase = phrase. length (); int I = len_phrase, j = 0; while (I> 0) {int start = I-maxlen; if (start <0) start = 0; string phrase_sub = phrase. substring (start, I); for (j = 0; j <phrase_sub.length (); j ++) {if (j = phrase_sub.length ()-1) break; string key = phrase_sub.substring (j); if (seg _ Dict. contains (key) {bmm_list.insertElementAt (key, 0); I-= key. length ()-1; break ;}}if (j = phrase_sub.length ()-1) bmm_list.insertElementAt ("" + phrase_sub.charAt (j), 0); I-= 1 ;} return bmm_list;}/*** this method combines the forward and reverse matching results, get the final result of Word Segmentation * @ param FMM2 forward matching word segmentation result * @ param BMM2 reverse matching word segmentation result * @ param return word segmentation final result */public static Vector <String> segment (String phrase) {Vector <String> fmm_list = FMM2 (phrase); Vector <St Ring> bmm_list = BMM2 (phrase); // if the number of positive and reverse word segmentation results is different, the if (fmm_list.size () with a small number of Word Segmentation is used ()! = Bmm_list.size () {if (fmm_list.size ()> bmm_list.size () return bmm_list; else return fmm_list ;} // if the word segmentation result has the same number of words else {// if the positive and reverse word segmentation results are the same, there is no ambiguity. You can return any int I, FSingle = 0, BSingle = 0; boolean isSame = true; for (I = 0; I <fmm_list.size (); I ++) {if (! Fmm_list.get (I ). equals (bmm_list.get (I) isSame = false; if (fmm_list.get (I ). length () = 1) FSingle + = 1; if (bmm_list.get (I ). length () = 1) BSingle + = 1;} if (isSame) return fmm_list; else {// different word segmentation results, returns the if (BSingle> FSingle) return fmm_list; else return bmm_list ;}} public static void main (String [] args) with fewer words) {String test = "I am a student"; FBSegment. init (); System. out. println (FBSegment. segment (test ));}}

Output: [I, yes, one student]


For more information, you can https://github.com/talentlei here

Refer:

Chinese Word Segmentation Algorithm Note: http://www.cnblogs.com/lvpei/archive/2010/08/04/1792409.html;

Summary of Chinese word segmentation algorithms: http://blog.csdn.net/chenlei0630/article/details/40710325;




Why are the mainstream Chinese Word Segmentation technologies that do not use reverse maximum matching or bidirectional maximum matching word segmentation algorithms?

Speed is the key. Reverse matching requires the creation of reverse matching indexes. Difficult Operation and Maintenance
 
How to Implement the Chinese word segmentation algorithm?

Common Word Segmentation Algorithms include forward maximum matching, reverse maximum matching, bidirectional maximum matching, best matching, least word segmentation, and word grid algorithms.
Forward Maximum Matching method (FMM): select a string containing 6-8 Chinese characters as the Maximum symbol string, and match the Maximum symbol string with the word entries in the dictionary, then, a Chinese character is removed and matched until the corresponding word is found in the dictionary. The matching direction is from right to left.
Backward Maximum Matching method (BMM): the Matching direction is opposite to the MM method, from left to right. Experiments show that the reverse maximum matching method is more effective than the maximum matching method in Chinese.
Bi-direction Matching method (BM): Compares the splitting results of MM and RMM to determine the correct splitting.
Optimum Matching method (OM): sorts the words in the dictionary according to their occurrence frequency in the text, and ranks the words with high frequencies before, low-frequency words are placed behind each other to increase the matching speed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.