The realization of corpus clauses in Python3 Chinese and Japanese

Source: Internet
Author: User

0. Background

Because recently in the parallel Corpus sentence alignment, the word alignment, want to do the right words need to do a clause first.
At the beginning of the use of regular and quotation marks to write a method, in the middle of a small trick, write to compare simple general, want to share this small piece of code.

1. Principle

In some cases, punctuation is also a good feature, here you want to try to correct the segmentation.
The main issues to consider include:

    • Delimiter retention
    • The sentence inside the quotation marks
    • Multiple punctuation at the same place

After deciding not to slice inside the quotation marks, use a little bit of skill to make the idea very clear:
Save the parentheses as a whole into a queue with a flag placeholder.
I'll replace it after I get it right.

2. Code

Note that a 0-wide regular is used as the partition flag, but Re.split () cannot be separated by it, resulting in valueerror.

def my_split (String): "" "in quotation marks as a whole to save with the queue, and then back to the ellipsis temporarily no # TODO can consider the words of the section, # for example, ' xxx: ' xxx. "Xx,xxxx.    ' # still can be divided. "" "split_sign = '%%%% ' # need to guarantee that the string itself does not have this delimiter # replaced by the symbol in: $PACK $ sign = ' $PACK $ ' search_pattern = re.compile (' \ $P ack\$ ') Pack_pattern = Re.compile (' (". +?" | (. +?) | ". +?" |〈.+?〉| [.+?]| ". +?" |. +? ' |". +?"|". +? "|". +?"| \ '. +?\ ') ') Pack_queue = [] Pack_queue = Re.findall (Pack_pattern, String) string = Re.sub (Pack_pattern, sign, stri NG) pattern = Re.compile (' (? <=[.?! ])(?! [。?! ] ') result = [] while string! = ': s = re.search (pattern, string) if S is None:result.ap        Pend (string) Break loc = S.span () [0] Result.append (string[:loc]) string = String[loc:] result_string = Split_sign.join (result) while pack_queue:pack = Pack_queue.pop (0) loc = Re.search (SE  Arch_pattern, result_string). span () result_string = result_string[:loc[0]] + pack + result_string[loc[1]:]  Return Result_string.split (split_sign) 
Reference

Using Python to implement Chinese clauses
GitHub address (stupid way I did not delete, always feel like some of the algorithm problem, but can not remember. )

The realization of corpus clauses in Python3 Chinese and Japanese

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.