0. Background
Because recently in the parallel Corpus sentence alignment, the word alignment, want to do the right words need to do a clause first.
At the beginning of the use of regular and quotation marks to write a method, in the middle of a small trick, write to compare simple general, want to share this small piece of code.
1. Principle
In some cases, punctuation is also a good feature, here you want to try to correct the segmentation.
The main issues to consider include:
- Delimiter retention
- The sentence inside the quotation marks
- Multiple punctuation at the same place
After deciding not to slice inside the quotation marks, use a little bit of skill to make the idea very clear:
Save the parentheses as a whole into a queue with a flag placeholder.
I'll replace it after I get it right.
2. Code
Note that a 0-wide regular is used as the partition flag, but Re.split () cannot be separated by it, resulting in valueerror.
def my_split (String): "" "in quotation marks as a whole to save with the queue, and then back to the ellipsis temporarily no # TODO can consider the words of the section, # for example, ' xxx: ' xxx. "Xx,xxxx. ' # still can be divided. "" "split_sign = '%%%% ' # need to guarantee that the string itself does not have this delimiter # replaced by the symbol in: $PACK $ sign = ' $PACK $ ' search_pattern = re.compile (' \ $P ack\$ ') Pack_pattern = Re.compile (' (". +?" | (. +?) | ". +?" |〈.+?〉| [.+?]| ". +?" |. +? ' |". +?"|". +? "|". +?"| \ '. +?\ ') ') Pack_queue = [] Pack_queue = Re.findall (Pack_pattern, String) string = Re.sub (Pack_pattern, sign, stri NG) pattern = Re.compile (' (? <=[.?! ])(?! [。?! ] ') result = [] while string! = ': s = re.search (pattern, string) if S is None:result.ap Pend (string) Break loc = S.span () [0] Result.append (string[:loc]) string = String[loc:] result_string = Split_sign.join (result) while pack_queue:pack = Pack_queue.pop (0) loc = Re.search (SE Arch_pattern, result_string). span () result_string = result_string[:loc[0]] + pack + result_string[loc[1]:] Return Result_string.split (split_sign)
Reference
Using Python to implement Chinese clauses
GitHub address (stupid way I did not delete, always feel like some of the algorithm problem, but can not remember. )
The realization of corpus clauses in Python3 Chinese and Japanese