Natural language processing 3.4--using regular expressions to detect phrase collocation

Source: Internet
Author: User
Tags nltk

Many language processing tasks involve pattern matching. Previously we used ' Stsrtswith (str) ' or ' EndsWith (str) ' to find a specific word. But following the introduction of a regular expression, the regular expression is a powerful module, which he does not belong to a particular language, is a powerful language processing tools.

Using regular expressions in Python requires importing the RE module using the import re. You also need a glossary of words to search for. Here again we use the previously used corpus and preprocess it to eliminate some names.

>>>import re>>>wordlist=[w for W in Nltk.corpus.words.words (' en ') if w.islower ()]

1. Use basic meta-characters

Use the regular expression "ed$" to find the words that end with Ed. Use the function Re.search (p,s) to check if there is a pattern p in the string s. Use the dollar sign, which is used in regular expressions to match the end of a word.

>>>print ([w for W wordlist if Re.search (' ed$ ', W)]) [[' abaissed ', ' abandoned ', ' abased ', ' abashed ', ' abatised ' , ' Abed ', ' aborted ', ' abridged ', ' abscessed ', ...]

wildcard character '. ' Used to match any single character. Suppose there is a 8-character crossword, J is the third letter, and T is the sixth letter.

>>>print ([w for W in Wordlist if Re.search (' ^.. T.. T.. $ ', W)]) [' abjectly ', ' adjuster ', ' dejected ', ' dejectly ', ' injector ', ' majestic ', ' objectee ', ' objector ', ' Rejecter ',... ]

Inserts the character ' ^ ' to match the beginning of the string.

2. Range and closures

In the phone input method Lenovo hint, nine Gongge, input sequence 4633 can get hole and golf, can also produce what characters? Use the following regular expression to determine:

>>>print ([w for W-wordlist if Re.search (' ^[ghi][mno][jlk][def]$ ', W)]) [' Gold ', ' golf ', ' hold ', ' hole ']

The ' + ' sign in the regular expression represents ' one or more instances of the preceding item '. ' * ' denotes ' 0 or more instances of the preceding item '. There are other functions when ' ^ ' appears in square brackets at the first character position. For example, "[^aeiou]" matches all letters except the vowel letters.

Here are some examples of other regular expressions. Use some new symbols: |, {}, and |

 >>> WSJ = sorted (Set (Nltk.corpus.treebank.words ())) >>> [w for W in WSJ if Re.search (' ^[0-9]+\.[ 0-9]+$ ', W)] [' 0.0085 ', ' 0.05 ', ' 0.1 ', ' 0.16 ', ' 0.2 ', ' 0.25 ', ' 0.28 ', ' 0.3 ', ' 0.4 ', ' 0.5 ', ' 0.50 ', ' 0.54 ', ' 0.56 ', ' 0.60 '  ', ' 0.7 ', ' 0.82 ', ' 0.84 ', ' 0.9 ', ' 0.95 ', ' 0.99 ', ' 1.01 ', ' 1.1 ', ' 1.125 ', ' 1.14 ', ' 1.1650 ', ' 1.17 ', ' 1.18 ', ' 1.19 ', ' 1.2 ',  ...] >>> [w for W in WSJ if Re.search (' ^[a-z]+\$$ ', W)] [' C $ ', ' US $ '] >>> [w for W in WSJ if Re.search (' ^[0  -9]{4}$ ', W)] [' 1614 ', ' 1637 ', ' 1787 ', ' 1901 ', ' 1903 ', ' 1917 ', ' 1925 ', ' 1929 ', ' 1933 ', ...] >>> [w for W in WSJ if Re.search (' ^[0-9]+-[a-z]{3,5}$ ', W)] [' 10-day ', ' 10-lap ', ' 10-year ', ' 100-share ', ' 12-poi  NT ', ' 12-year ', ...] >>> [w for W in WSJ if Re.search (' ^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$ ', W)] [' black-and-white ', ' bread-and-butter '  , ' father-in-law ', ' machine-gun-toting ', ' Savings-and-loan ') >>> [w for W in WSJ if Re.search (' (ed|ing) $ ', W)] [' 62%-owned ', ' absorbed ', ' AcCording ', ' adopting ', ' advanced ', ' advancing ', ...] 

The regular expressions are summarized as follows:

Table 3-3. Basic Regular expression metacharacters, including wildcards, ranges, and closures

Operator

Behavior

.

wildcard character, matching all characters

^abc

Matches a string starting with ABC

abc$

Match a string ending with ABC

[ABC]

Match Character Set fit

[A-z0-9]

Match character Range

Ed|ing|s

Matches the specified string

*

0 or more of the preceding items (Kleene closures)

+

One or more of the preceding items

one or 0 of the items in front (optional)

N

Repeats n times, N is a non-negative integer

{N,}

Repeat at least n times

{, n}

Repeat n times at most

{M,n}

Repeat more than m times no more than n times

A (B|C) +

Parentheses denote the range of operators

When we use the RE regular expression, we can use the string plus a prefix ' r ' to represent an original string.

Natural language processing 3.4--using regular expressions to detect phrase collocation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.