Many language processing tasks involve pattern matching. Previously we used ' Stsrtswith (str) ' or ' EndsWith (str) ' to find a specific word. But following the introduction of a regular expression, the regular expression is a powerful module, which he does not belong to a particular language, is a powerful language processing tools.
Using regular expressions in Python requires importing the RE module using the import re. You also need a glossary of words to search for. Here again we use the previously used corpus and preprocess it to eliminate some names.
>>>import re>>>wordlist=[w for W in Nltk.corpus.words.words (' en ') if w.islower ()]
1. Use basic meta-characters
Use the regular expression "ed$" to find the words that end with Ed. Use the function Re.search (p,s) to check if there is a pattern p in the string s. Use the dollar sign, which is used in regular expressions to match the end of a word.
>>>print ([w for W wordlist if Re.search (' ed$ ', W)]) [[' abaissed ', ' abandoned ', ' abased ', ' abashed ', ' abatised ' , ' Abed ', ' aborted ', ' abridged ', ' abscessed ', ...]
wildcard character '. ' Used to match any single character. Suppose there is a 8-character crossword, J is the third letter, and T is the sixth letter.
>>>print ([w for W in Wordlist if Re.search (' ^.. T.. T.. $ ', W)]) [' abjectly ', ' adjuster ', ' dejected ', ' dejectly ', ' injector ', ' majestic ', ' objectee ', ' objector ', ' Rejecter ',... ]
Inserts the character ' ^ ' to match the beginning of the string.
2. Range and closures
In the phone input method Lenovo hint, nine Gongge, input sequence 4633 can get hole and golf, can also produce what characters? Use the following regular expression to determine:
>>>print ([w for W-wordlist if Re.search (' ^[ghi][mno][jlk][def]$ ', W)]) [' Gold ', ' golf ', ' hold ', ' hole ']
The ' + ' sign in the regular expression represents ' one or more instances of the preceding item '. ' * ' denotes ' 0 or more instances of the preceding item '. There are other functions when ' ^ ' appears in square brackets at the first character position. For example, "[^aeiou]" matches all letters except the vowel letters.
Here are some examples of other regular expressions. Use some new symbols: |, {}, and |
>>> WSJ = sorted (Set (Nltk.corpus.treebank.words ())) >>> [w for W in WSJ if Re.search (' ^[0-9]+\.[ 0-9]+$ ', W)] [' 0.0085 ', ' 0.05 ', ' 0.1 ', ' 0.16 ', ' 0.2 ', ' 0.25 ', ' 0.28 ', ' 0.3 ', ' 0.4 ', ' 0.5 ', ' 0.50 ', ' 0.54 ', ' 0.56 ', ' 0.60 ' ', ' 0.7 ', ' 0.82 ', ' 0.84 ', ' 0.9 ', ' 0.95 ', ' 0.99 ', ' 1.01 ', ' 1.1 ', ' 1.125 ', ' 1.14 ', ' 1.1650 ', ' 1.17 ', ' 1.18 ', ' 1.19 ', ' 1.2 ', ...] >>> [w for W in WSJ if Re.search (' ^[a-z]+\$$ ', W)] [' C $ ', ' US $ '] >>> [w for W in WSJ if Re.search (' ^[0 -9]{4}$ ', W)] [' 1614 ', ' 1637 ', ' 1787 ', ' 1901 ', ' 1903 ', ' 1917 ', ' 1925 ', ' 1929 ', ' 1933 ', ...] >>> [w for W in WSJ if Re.search (' ^[0-9]+-[a-z]{3,5}$ ', W)] [' 10-day ', ' 10-lap ', ' 10-year ', ' 100-share ', ' 12-poi NT ', ' 12-year ', ...] >>> [w for W in WSJ if Re.search (' ^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$ ', W)] [' black-and-white ', ' bread-and-butter ' , ' father-in-law ', ' machine-gun-toting ', ' Savings-and-loan ') >>> [w for W in WSJ if Re.search (' (ed|ing) $ ', W)] [' 62%-owned ', ' absorbed ', ' AcCording ', ' adopting ', ' advanced ', ' advancing ', ...]
The regular expressions are summarized as follows:
Table 3-3. Basic Regular expression metacharacters, including wildcards, ranges, and closures
Operator |
Behavior |
. |
wildcard character, matching all characters |
^abc |
Matches a string starting with ABC |
abc$ |
Match a string ending with ABC |
[ABC] |
Match Character Set fit |
[A-z0-9] |
Match character Range |
Ed|ing|s |
Matches the specified string |
* |
0 or more of the preceding items (Kleene closures)
|
+ |
One or more of the preceding items |
|
one or 0 of the items in front (optional) |
N |
Repeats n times, N is a non-negative integer |
{N,} |
Repeat at least n times |
{, n} |
Repeat n times at most |
{M,n} |
Repeat more than m times no more than n times |
A (B|C) + |
Parentheses denote the range of operators |
When we use the RE regular expression, we can use the string plus a prefix ' r ' to represent an original string.
Natural language processing 3.4--using regular expressions to detect phrase collocation