Natural Language Processing 3.7-use a regular expression for text segmentation, natural language processing 3.7

Source: Internet
Author: User
Tags nltk

Natural Language Processing 3.7-use a regular expression for text segmentation, natural language processing 3.7

1. Simple word segmentation method:

Text Segmentation by space characters is the easiest method for text segmentation. Think about the text from Alice in Wonderland.

>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very... well without--Maybe it's always pepper that makes people hot-tempered,'..."""

You can use raw. split () to separate the original text with space characters. Using Regular Expressions can do the same thing. It is far from enough to match all the blank characters in the string, because this will cause the result to contain '\ n' line breaks. Match any number of space characters, tabs, or line breaks at the same time.

>>>import re>>>re.split(r' ',raw)["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in','a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe',"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]>>>re.split(r'[ \t\n]+',raw)["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in','a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

The regular expression "[\ t \ n] +" matches multiple spaces, tabs, or line breaks. Other blank characters, such as carriage return and line breaks, should also be included. Therefore, the re Library has the built-in abbreviation '\ s', which indicates matching all the blank characters. The second statement in the preceding example can be rewritten to re. split (r '\ s', raw)

An identifier such as "(not" and "herself," can be obtained after a space character is separated. Another method is to use the character class "\ w" provided by Python to match all characters, which is equivalent to [0-9a-zA-Z]. define the supplementary part of this class "\ W", that is, all letters, characters other than numbers. You can also use \ W to separate all words in a simple regular expression.

>>> re.split(r'\W+', raw)['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in','a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without','Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered','']

We can see that there is an empty string at the beginning and end, through re. findall (R' \ w + ', raw) uses the pattern matching vocabulary instead of the blank symbol to get the same identifier. But there is no null string.

The regular expression "\ w + | \ S \ w *" first tries to match all sequences of Characters in the word. If no matching is found, will try to match any non-blank characters followed by the characters in the word. This means that the punctuation marks will be very followed by letters such as's, but two or more punctuation marks will be separated.

>>> re.findall(r'\w+|\S\w*', raw)["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',','(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does','very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that','makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

Extended "\ w +" in the regular expression above, and allowed the hyphen "\ w + (-') \ w + ", it matches words like hot-dog and it's. You also need to add a pattern to match the quotation mark characters so that they are separated from the text they contain.

>>> print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',','(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I',"won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup','does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper','that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


NLTK Regular Expression Divider

The nltk. regexp_tokenize () and re. findall () functions are of the same type. However, nltk. regexp_tokenize () is more efficient in Word Segmentation, avoiding the need for special processing of parentheses. To increase readability, the regular expression is divided into several lines for writing, and an explanation is added for each line. (? X) The 'verbo' flag tells Python to remove the embedded comments and spaces.

>>> text = 'That U.S.A. poster-print costs $12.40...'>>> pattern = r'''(?x)    # set flag to allow verbose regexps...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A....   | \w+(-\w+)*        # words with optional internal hyphens...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%...   | \.\.\.            # ellipsis...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [... '''>>> nltk.regexp_tokenize(text, pattern)['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

When using the verbose flag, you can use '\ s' instead of' to match space characters. Regexp_tokenize () has an optional parameter gaps. When set to True, the regular expression specifies the distance between identifiers. Just like re. split.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.