Due to the need, I wrote a regular expression for sentence segmentation and considered various special cases.
Public static RegEx uselesspunctionregex = new RegEx (@"'(?! (S | T | re | M) (| $) | \. $ | \. | \. {2,} | '| ~ |! | @ | # | \ $ | % | \ ^ | \ * | \ (| \) | (^ | [^ \ W]) -+ |-+ ($ | [^ \ W]) | _ | = | \ + | \ [| \] |\{|\}|<|>|\||||// |;|: | "" | • |-|, | \? | × |! | · |... |-| (|) |, |: | '| "|, |. |? ");
For example, the following sentence can be correctly divided:
@ "What's your name? My name is Han Mei-mei. I am from U. S.! Nice to meet you! "
It is divided:
What's your name
My name is Han Mei-mei
I am from U. S.
Nice to meet you
It ensures that the special punctuation marks do not become delimiters, such as what's, Mei-mei, U. S..
I pressed all the punctuation on the keyboard.
Note: This regular expression cannot recognize the period separator without spaces. For example
Nice to meet you. Nice to meet you, too.
Can only be divided
Nice to meet you. Nice to meet you
Too
To solve this problem, we recommend that you first use the opennlp tool, which contains a class named
Englishmaximumentropysentencedetector
It has a method named
Sentencedetect ()
Sentences ending with an ending character (period, question mark, or ellipsis) can be separated,
Then use the regular expression above to make it more accurate.