Natural Language Processing 3.7-use a regular expression for text segmentation, natural language processing 3.7

Last Update:2016-10-22 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Natural Language Processing 3.7-use a regular expression for text segmentation, natural language processing 3.7

1. Simple word segmentation method:

Text Segmentation by space characters is the easiest method for text segmentation. Think about the text from Alice in Wonderland.

>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very... well without--Maybe it's always pepper that makes people hot-tempered,'..."""

You can use raw. split () to separate the original text with space characters. Using Regular Expressions can do the same thing. It is far from enough to match all the blank characters in the string, because this will cause the result to contain '\ n' line breaks. Match any number of space characters, tabs, or line breaks at the same time.

>>>import re>>>re.split(r' ',raw)["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in','a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe',"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]>>>re.split(r'[ \t\n]+',raw)["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in','a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

The regular expression "[\ t \ n] +" matches multiple spaces, tabs, or line breaks. Other blank characters, such as carriage return and line breaks, should also be included. Therefore, the re Library has the built-in abbreviation '\ s', which indicates matching all the blank characters. The second statement in the preceding example can be rewritten to re. split (r '\ s', raw)

An identifier such as "(not" and "herself," can be obtained after a space character is separated. Another method is to use the character class "\ w" provided by Python to match all characters, which is equivalent to [0-9a-zA-Z]. define the supplementary part of this class "\ W", that is, all letters, characters other than numbers. You can also use \ W to separate all words in a simple regular expression.

>>> re.split(r'\W+', raw)['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in','a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without','Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered','']

We can see that there is an empty string at the beginning and end, through re. findall (R' \ w + ', raw) uses the pattern matching vocabulary instead of the blank symbol to get the same identifier. But there is no null string.

The regular expression "\ w + | \ S \ w *" first tries to match all sequences of Characters in the word. If no matching is found, will try to match any non-blank characters followed by the characters in the word. This means that the punctuation marks will be very followed by letters such as's, but two or more punctuation marks will be separated.

>>> re.findall(r'\w+|\S\w*', raw)["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',','(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does','very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that','makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

Extended "\ w +" in the regular expression above, and allowed the hyphen "\ w + (-') \ w + ", it matches words like hot-dog and it's. You also need to add a pattern to match the quotation mark characters so that they are separated from the text they contain.

>>> print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',','(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I',"won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup','does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper','that', 'makes', 'people', 'hot-tempered', ',', "'", '...']

NLTK Regular Expression Divider

The nltk. regexp_tokenize () and re. findall () functions are of the same type. However, nltk. regexp_tokenize () is more efficient in Word Segmentation, avoiding the need for special processing of parentheses. To increase readability, the regular expression is divided into several lines for writing, and an explanation is added for each line. (? X) The 'verbo' flag tells Python to remove the embedded comments and spaces.

>>> text = 'That U.S.A. poster-print costs $12.40...'>>> pattern = r'''(?x)    # set flag to allow verbose regexps...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A....   | \w+(-\w+)*        # words with optional internal hyphens...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%...   | \.\.\.            # ellipsis...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [... '''>>> nltk.regexp_tokenize(text, pattern)['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

When using the verbose flag, you can use '\ s' instead of' to match space characters. Regexp_tokenize () has an optional parameter gaps. When set to True, the regular expression specifies the distance between identifiers. Just like re. split.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More