Regular expressions and text mining-TextMining

Source: Internet
Author: User
During text mining, the wildcards (Wildchar) in TSQL are insufficient. in this case, using "CLR + Regular Expressions" is a good choice. Regular expressions seem very complex,, familiar with the metadata of regular expressions, you can skillfully and flexibly use regular expressions to complete complex TextMining work. During text mining, the wildcards (Wildchar) in TSQL are insufficient. in this case, using "CLR + Regular Expressions" is a good choice. Regular expressions seem very complex,, familiar with the metadata of regular expressions, you can skillfully and flexibly use regular expressions to complete complex Text Mining work.

1. special characters of a regular expression

1. common metacharacters

It is used to match specific characters (letters, numbers, symbols). Note that letters are case sensitive:

.: Match any character except line break
\ W: matching letters, numbers, underscores, or Chinese characters
\ S: matches any blank space character
\ D: Matching number
\ B: start or end of a matching word
^: Start of matching string
$: End of matching string
\ K: name of the referenced Group. for example, \ k indicates that the group named group_name is referenced.
\ Group_number: group_number indicates the group number, such as group numbers 1, 2, and 3, which indicates that the group is referenced by group numbers.
2. repeated characters or groups

Specify the number of times the previous character or group is repeated:

: Repeated zero or more times

: Repeat once or more times
? : Zero or one repeat
{N}: repeated n times
{N ,}: repeat n times or more times
{N, m}: Repeat n to m times
3. grouping, escape, branch, and qualifier

These characters have specific meanings and purposes:

(): Represents a group with parentheses
<>: Defines the group name. The string between <and> is the group name.
\: Escape character, transfer special characters to common characters, for example: (, represents the parentheses "(", parentheses are no longer used as special characters
|: Branch. The relationship between expressions is "or ".
[]: Specifies the character list. a character must match any character in the list, and a list of matched characters must be specified in brackets. for example, [aeiou] must be any of aeiou;
[^]: Specifies the Exclusion character list. a character cannot be any character in the exclusion list, and the list of excluded characters is specified in brackets. for example: [^ aeiou] a character cannot be any of aeiou;
II. group reference

A group is a subexpression specified by parentheses. a group reference refers to the repeated use of a subexpression in an expression to make the regular expression more concise. By default, the regular expression automatically assigns a group number to each group. The rule is: group numbers start from 1, left to right, and group numbers are sequentially added with 1 (base-1). For example, the group number of the first group is 1, the group number of the second group is 2, and so on.

Three forms of group definition:

(Exp): The group number is automatically assigned and referenced by the group number;
(? Exp): name group, which is referenced by the group name;
(? : Exp): this group only matches the text at the current position. after this group, it cannot be referenced. this group has no group name or group number;
1. reference a group by group number

A group (exp) is defined before the regular expression. after the expression, the group expression can be referenced by the group number. The syntax for referencing a group is \ group_number;

For example, \ B (\ w +) \ B \ s + \ 1 \ B. in this regular expression, only one group (\ w +) exists and the group number is 1, after the group, use \ 1 to reference the group. replace \ 1 with the group's subexpression, which is equivalent to \ B (\ w +) \ B \ s + (\ w +) \ B.

2. reference a group by group name

In a regular expression, you can name a group in the following format :(? Exp), the group name is name, and the format of the group to be referenced by name is \ k. The group is referenced by the group name and group number, and the text matching behavior is the same.

For example, \ B (? \ W +) \ B \ s + \ 1 \ B. after this group, use \ k to reference this group and replace \ k with its subexpression, it is equivalent to \ B (\ w +) \ B \ s + (\ w +) \ B.

3. groups that cannot be referenced

(? : Exp): a group defined by this syntax cannot be referenced. it can only match text at the current position. a regular expression does not automatically assign a group number to this group.

3. search for assertions

An assertion is a logical expression that matches successfully only when the expression is true. When a match is successful, the returned text does not contain the prefix or suffix, that is, assertions are used to find the text before or after a specific "text. Four types of assertion syntax:

(? = Exp): matches the expression exp after the text, and returns the expression before the exp position.
(? <= Exp): match the expression exp before the text, and return the expression after the exp position.
(?! Exp): The text suffix is not exp, and the returned suffix is not an exp expression.
(? <! Exp): The Text prefix is not exp, and the returned prefix is not an exp expression.
1. suffix matching

(? = Exp): matches the expression exp after the text, and returns the expression before the exp position. Suffix match, similar to "% ing" in TSQL;

For example, the regular expression \ B \ w + (? = Ing \ B)

Analysis: asserted that its suffix is ing and it is the end of the word (\ B). It matches the word ending with ing, but returns the first part of the word, the part before ing;

For example, if you look for "I'm reading a book", it will match "reading" because the character ends with ing and the regular expression returns read. the asserted text does not contain a suffix.

2. prefix matching

(? <= Exp): match the expression exp before the text, and return the expression after the exp position. Prefix match, similar to "re %" in TSQL;
For example, regular expressions :(? <= \ Bre) \ w + \ B

Analysis: The start of a word (\ B), and the prefix of a word is re. match a word starting with re, and return the second half of the word, the part after re;

For example, if you look for "I am reading a book", it will match "reading". because the character is prefixed with re, the regular expression returns ading, and the returned text does not contain the prefix.

3. search for text with a prefix or suffix that is not specific to the text.

The search for these two assertions is opposite to the previous two, which has little effect. let's take a look at it briefly:

(?! Exp): The text suffix is not exp, and the returned suffix is not an exp expression.
(? <! Exp): The Text prefix is not exp, and the returned prefix is not an exp expression.
3.1 for example, the regular expression \ B \ w + (?! Ing \ B)

Analysis: If the word ending with ing is not matched, search for "I am reading a book". The Returned text is: I, am, a, book.

3.2 for example, regular expression :(? <! \ Bre) \ w + \ B

Analysis: If the words starting with re are not matched, search for "I am reading a book". The Returned text is: I, am, a, book.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.