Design and Implementation of. NET-based Word Segmentation software V1.0-overall idea and Algorithm Implementation

Source: Internet
Author: User

First, we will introduce the overall idea of this word splitting software.

After reading some relevant information on the Internet, we generally use "forward maximum matching algorithm" and "reverse maximum matching algorithm". If you are not familiar with it, let's take a look at the basic ideas of the two algorithms:

The basic idea of the forward maximum match algorithm is: first, find whether the text exists in the dictionary. If so, extract it directly. If not, delete the last word of the text and check whether it is a word, if the word is output and the phrase is subtracted from the word, the system continues to judge whether the word exists in the font and repeats until a word is output, in this way, a phrase can be divided into a combination of words.

The inverse maximum matching algorithm is the opposite. If you find that the text to be segmented does not exist in the dictionary, delete the first word of the text and perform the matching again.

Experiments show that positive maximum matching has poor identification of ambiguity and the accuracy of Word Segmentation is not high. Therefore, my word segmentation software uses the idea of "inverse maximum matching algorithm.

Bytes ---------------------------------------------------------------------------------------------------------------------------

However, if we use the reverse largest matching algorithm to split any piece of text as a whole, we can imagine the complexity and efficiency. How should we perform word segmentation, here I use the punctuation sentence method, that is, to use a regular expression to match the text, find the punctuation in the text, and then use this as a marker to break the sentence, split a long piece of text into small pieces of text, and then use the reverse largest matching algorithm to perform word segmentation.

Now, we have to put it into practice:

1. First, I downloaded the "punctuation Dictionary" from the Internet ". As follows:

Of course, this is a Chinese Punctuation, but because it is a regular expression for matching, so the English Punctuation cannot be treated like this. Here I have escaped:

This prevents these punctuation marks from being parsed into special characters by regular expressions.

2. Use regular expressions to cut sentences: match the text to be segmented with the punctuation dictionary, and mark the sentence with the matched position.

1 /// <summary>

2 // create a string consisting of punctuation marks

3 /// </summary>

4 /// <param name = "dictPath"> path of the punctuation dictionary </param>

5 string GetPunctuationDictionary (string dictPath)

6 {

7 StringBuilder strBuilder = new StringBuilder ();

8 foreach (string s in File. ReadAllLines (dictPath ))

9 {

10 strBuilder. Append (s );

11}

12 return strBuilder. ToString ();

13}

1 /// <summary>

2 // punctuation string

3 /// </summary>

4 string _ splitters;

5

6 /// <summary>

7 // obtain the set of matching punctuation characters in the string (remove duplicates)

8 /// </summary>

9 List <string> GetMatchedSpiltters ()

10 {

11 List <string> matchedSplitters = new List <string> ();

12 Regex regex = new Regex ("[" + _ splitters + "]");

13 foreach (var item in regex. Matches (_ inputStr ))

14 {

15 if (! MatchedSplitters. Contains (item. ToString ()))

16 matchedSplitters. Add (item. ToString ());

17}

18 return matchedSplitters;

19} For example, "Example: This is my word splitting software. Thank you for using it !" After the above matching operation, the resulting matchedSplitters are:, and! .

3. Use the punctuation position matched by a regular expression as the marker to remove the text punctuation.

1 /// <summary>

2 // remove punctuation marks in the text to be segmented and separate them into independent sentences (Remove empty strings)

3 /// </summary>

4 List <string> GetStripedPunctuationString ()

5 {

6 List <string> matchedSplitters = GetMatchedSpiltters ();

7 foreach (string str in matchedSplitters)

8 {

9 _ inputStr = _ inputStr. Replace (str ,"");

10}

11 return _ inputStr. Split (new string [] {""}, StringSplitOptions. RemoveEmptyEntries). ToList ();

12} For example, "Example: This is my word splitting software. Thank you for using it !" After the above operations, I changed it to "Example: This is my word splitting software. Thank you for using it"

4. First, determine whether the text to be segmented exists in the word segmentation dictionary. If so, save the text in the set to be output. Otherwise, use the inverse maximum matching algorithm to perform word segmentation.


/// <Summary>

/// Use the reverse matching algorithm to split strings

/// </Summary>

/// <Param name = "str"> string to be segmented </param>

List <string> Spiltter (string str)

{

// Output result set

List <string> result = new List <string> ();

If (str. Length = 1)

{

Result. Add (str );

Return result;

}

// Delete the first word

String initStr = str. Remove (0, 1 );

Int l = 0;

While (! String. IsNullOrEmpty (initStr ))

{

For (int I = 1; I <str. Length; I ++)

{

// If it is a single word or exists in the dictionary

If (initStr. Length = 1 | _ dict. Keys. Contains (initStr ))

{

Result. Add (initStr + "\");

L + = initStr. Length; & n

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.