Regular expressions and text mining--text Mining

Last Update:2016-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the case of text mining, the wildcard character (Wildchar) in TSQL appears to be insufficient, at this time, using "clr+ regular expression" is a very good choice, the regular expression seems very complex, but, original aim, proficiency in the regular expression of metadata, Can skillfully and flexibly use regular expressions to complete the complex text mining work.

One, the special character of the regular expression

1, common meta characters

to match specific characters (letters, numbers, symbols), note that the letters are case-sensitive:

. : matches any character except line break
\w: Matches letters or numbers or underscores or kanji
\s: matches any whitespace character
\d: Matching numbers
\b: Matches the beginning or end of a word
^: Match the start of a string
$: Matches the end of a string
\k: Reference group name, for example: \k, to refer to a group named Group_name
\group_number:group_number is a group number, such as a group number, and so on, indicating grouping by reference
2, repeating characters or groupings

Specify the number of times the preceding character or grouping repeats:

: Repeat 0 or more times

: Repeat one or more times
? : Repeat 0 or one time
{n}: repeat n times
{N,}: Repeat N or more times
{n,m}: repeats N to M times
3, grouping, escaping, branching, qualifier

These characters have specific meanings and uses:

(): Use parentheses to denote a grouping
<>: A string that defines the grouping name < and > is the group name
\: Escape character, transfer special characters to ordinary characters, for example: (, parentheses "(", parentheses no longer as special characters
| : branches, relationships between expressions that are "or"
[]: Specifies a list of qualified characters, one character must match any character in the list, and a matching list of characters in brackets, for example: [Aeiou] A character must be aeiou any one;
[^]: Specifies a list of excluded characters, one character cannot be any of the characters in the exclusion list, and a list of excluded characters is specified in brackets, for example: [^aeiou] A character cannot be any of the aeiou;
Second, grouping references

Grouping, which is a sub-expression specified with parentheses, and a grouping reference, which refers to reusing sub-expressions in an expression to make the regular expression more concise. By default, regular expressions automatically assign a group number to each grouping, with the rule: The group number starts at 1, left to right, and the group number is added 1 (base-1), for example, the group number for the first group is 1, the group number for the second group is 2, and so on.

Three types of grouping definitions:

(exp): Assign the group number automatically, and refer to the grouping by the group number;
(? exp): Named grouping, which refers to the grouping by grouping name;
(?: EXP): The grouping only matches the text at the current position, after which the grouping cannot be referenced, the grouping has no group name, and no group number;
1, grouping by group number reference

A grouping (exp) is defined before the regular expression, which, after the expression, is able to refer to the grouped expression by the group number, and the syntax for the reference grouping is: \group_number;

For example: \b (\w+) \b\s+\1\b, in the regular expression, there is only one grouping (\w+), the group number is 1, after the grouping, using \1 to refer to the group, \1 is replaced with a grouped sub-expression, equivalent to: \b (\w+) \b\s+ (\w+) \b.

2, grouping by grouping name reference

In regular expressions, you can name the group, named Grouping format: (? exp), the grouping name is name, and the format for referencing the grouping by name is: \k, grouped by grouping name and group number reference, the behavior of text matching is the same.

For example: \b (? \w+) \b\s+\1\b, after the grouping, uses \k to refer to the grouping, substituting \k with the grouped sub-expression, equivalent to: \b (\w+) \b\s+ (\w+) \b.

3, unable to reference the grouping

(?: EXP): A grouping that is defined with this syntax, cannot be referenced, can only match text at the current location, and the regular expression does not automatically assign group numbers to the group.

Third, assert lookup

An assertion is a logical expression that is successful only if the expression is true. When the match succeeds, the text is returned, and the returned text does not contain a prefix or suffix, that is, the assertion is used to find text before or after a specific "text." Four kinds of syntax for assertions:

(? =exp): After the text matches the expression exp, returns the expression before the exp position
(? <=exp): The front of the text matches the expression exp, which returns an expression after the exp position
(?! EXP): The suffix of the text is not exp, the return suffix is not an exp expression
(? <!exp): The prefix of the text is not exp, and the return prefix is not an exp expression
1, suffix matching

(? =exp): After the text matches the expression exp, returns the expression before the exp position. Suffix matching, similar to TSQL's "%ing";

Like regular expressions: \b\w+ (? =ing\b)

Analysis: The assertion suffix is ing and is the end of the word (\b), matching the word ending with ING, but returning the previous part of the word, ing before the part;

For example, find "I ' m reading a book", which matches "reading" because the character is later terminated with ING, the regular expression returns read, and the text returned by the assertion does not contain a suffix.

2, prefix matching

(? <=exp): The front of the text matches the expression exp, which returns an expression after the exp position. Prefix matching, similar to TSQL's "re%";
such as regular expressions: (? <=\bre) \w+\b

Analysis: The beginning of a word (\b), and the prefix of the word is re, matching the word starting with RE, return the second half of the word, the part after re;

For example, find "I am reading a book", which matches "reading" because the character preceded by the RE, the regular expression returns ading, and the text returned by the assertion does not contain a prefix.

3. Find text with prefix or suffix that is not a specific text

These two assertion lookups, in contrast to the previous two, do little to get a quick look:

(?! EXP): The suffix of the text is not exp, the return suffix is not an exp expression
(? <!exp): The prefix of the text is not exp, and the return prefix is not an exp expression
3.1 For example, regular expression: \b\w+ (?! ing\b)

Parse: Do not match words ending with ing, find "I am reading a book", Return Text: I,am,a,book

3.2 For example, regular expression: (?<!\bre) \w+\b

Parse: Do not match the word with Re, find "I am reading a book", Return the text: I,am,a,book



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Regular expressions and text mining--text Mining

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support