Getting started with regular expressions, regular expressions

Source: Internet
Author: User

Getting started with regular expressions, regular expressions

Cjx is currently working on a crawler project and is eager to capture the content you want on the page. However, it is too complicated to obtain the content through logical judgment. We are glad to have the advantage of regular expressions. Many things can be done easily. We have some knowledge about regular expressions before cjx. However, it has been in a state of incomplete understanding, and it is difficult to effectively write a satisfactory Regular Expression by yourself. I recently found on the internet that Jeffrey E. F. Fried wrote a well-versed regular expression. After reading the first chapter, I suddenly found that I could write several regular expressions ~~~ Cjx suddenly has a sense of being upgraded from a diaosi to a rich handsome guy... next, I will summarize the first chapter in the book ~

Start and end of a row

Perhaps the most understandable metacharacters are the unsigned symbols ^ and the dollar sign $. When checking a line of text, ^ indicates the beginning of a line, and $ indicates the end.

It is best for readers to understand Regular Expressions by character. For example, do not:

^ Cat matches the line starting with cat

It should be understood as follows:

^ Cat matches c as the first character of a line, followed by a, followed by a t.

There is no difference between the two understandings, but it is easier to understand the internal logic of the new regular expression by character.

 

Match one of several characters

If we need to search for the word "gray" and are not sure whether it is "gray", we can use the regular expression structure [...]. It allows the user to list the expected matching characters at a certain place, usually called character groups.

So what gr [ea] y means: Find g first, followed by r, then a or e, and finally a y.

In a character group, the '-' character rental metacharacters indicate a range: <H [1-6]> exactly the same as <H [123456]>. We can also combine the character range with common text as we like:

[0-9A-Z _!.?] It can match a number, uppercase letter, underline, surprise number, DoT number, or question mark.

 

Excluded character group

Replacing [...] with [^...] will match any unlisted characters. For example, [^ 1-6] matches any character except 1 to 6. In this group, the ^ characters in the header are excluded. Therefore, not the characters to be matched, but the characters to be matched are listed here.

 

Match any character with the dot

Metacharacter is a simple way to match character groups of any character. If we need to use a placeholder "match any character" in the expression, it is very convenient to use the DoT number.


Match any subexpression

Metacharacter | it is a very concise metacharacter, which means "or ". With it, we can combine different subexpressions into a total expression, and this total expression can match any subexpression.

Option Element

Now let's look at the matching of color and color. The difference between them is that there is more u Than the previous word. Can we use coloru? R to solve this problem. Metacharacters? (That is, the question mark) indicates the option. Adding it to the end of a character indicates that this character is allowed here, but its appearance is not a necessary condition for successful matching.

Other quantifiers: repeated occurrence

+ (Plus sign) and * (asterisk) are used similarly to question marks. Metacharacters + indicates that the elements that are adjacent to each other appear once or multiple times, while * indicates that the elements that are adjacent to the same group appear multiple times or not.
Next, let's look at a TAG like

Brackets and reverse references

So far, we have seen two uses of parentheses: 1. limit the range of multiple options; 2. combine several characters into a unit, which is affected by quantifiers such as question marks or asterisks. Now I want to introduce another use of parentheses, that is, reverse reference. Although it is not common in egrep (the popular GNU version does support this function ), but it is common in other tool software.
In tool software that supports reverse reference, parentheses can "remember" The text in which the sub-expressions match. No matter what the text is, metacharacters \ 1 can remember them.

Of course, we can use multiple parentheses in an expression. Use \ 1, \ 2, \ 3, and so on to represent the text matching the brackets in the first, second, and third groups. Brackets follow the opening brackets ('order of appearance from left to right, so ([a-z]) ([0-9]) in \ 1 \ 2, \ 1 represents the content that [a-z] matches, while \ 2 represents the content that [0-9] matches.

Magic escape
Sometimes, we may need to match some. + *? But they are also meta symbols. Therefore, we can add a \ escape character before them to match these special metasymbols. For example, if the host name ega.att.com of an Internet host is matched, it can be written as example \. att \. com.

 
Some useful notes
\ T Tab
\ N linefeed
\ R carriage return
\ S any blank characters, such as spaces, line breaks, tab indentation, and other blank characters
\ S any character except \ s
\ W [a-zA-Z0-9] is useful in \ w + and can be used to match a word
\ W any character except \ w
\ D [0-9], which is a number
\ D any character except \ d, that is, [^ 0-9]

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.