Regular expressions

Source: Internet
Author: User

Label:

The regular expression ("regexes") is an enhanced find/string substitution operation. When you edit text in a text editor, regular expressions are often used to:

1. Check if the text contains a given pattern
2. Find any matching pattern
3. Pull information from text (such as truncation)
4. Modify the text
Like a text editor, most advanced programming languages support regular expressions. In this article, "text" is just a string variable, but valid operations are consistent. Some programming languages (PERL,JAVASCRIPT) even provide specialized syntax for regular expressions.

But what is the regular expression?


A regular expression is just a string. It has no length limit, but usually the string is short. Here are a few examples:
I had a \s+ day today
[A-za-z0-9\-_] {3,16}
\d\d\d\d-\d\d-\d\d
V (\d+) (\.\d+) *
Totalmessages= "(. *?)"
<[^<>]>


This string is actually a very small computational program, and the regular expression is a simple syntax and a domain-specific programming language. Keep in mind the following points, which should not surprise you during the learning process:

Each regular expression can be decomposed into a sequence of instructions. "Find this, find that, and find one of them ..."
A regular expression has input (text) and output (pattern matching, and sometimes custom text).
There is a syntax error--Not every string is a valid regular expression!
Grammar is a bit weird, it can also be said to be scary.
A regular expression can sometimes be compiled to run faster.
There have been significant changes in the regular implementation. For this article, I am concerned with the core syntax that almost every regular expression implements.

Practice
Gets a text editor that supports regular. I recommend notepad++.

Download a long prose story like the H. Gutenberg Publishing house. G. Wells's "Time Machine" and open it.

Download a dictionary, such as this, unzip and then open.

Everything is ready and start practicing later.

Tip: Regular expressions are completely incompatible with file wildcard syntax, such as *.xml.

Regular expression base syntax

Literal value (literals)
A regular expression consists of a literal representing itself and a metacharacters that represents a specific meaning.

Here are some examples. I'll highlight the meta-character.

I had a \s+ day today
[A-za-z0-9-_] {3,16}
\d\d\d\d-\d\d-\d\d
V (\d+) (. \d+) *
Totalmessages= "(. *?)"
<[^<>]*>
Most characters, including alphanumeric characters, appear in the form of literal values. This means that they are looking for themselves. For example, the regular expression cat represents "Find C first, then find a, and finally find T".

Feel good so far. This is really like

A normal Find dialog box
The String.IndexOf () function in Java
The Strpos () function in PHP
Wait a minute
Tip: Regular expressions are case-sensitive unless otherwise noted. However, most implementations will provide a token to turn on case-insensitive functionality.

Period (dot)
Our first meta-character is a period (translator note: period, English period), ... One. Indicates that any single character is matched. The following regular expression c.t means "Find C first, then find any single character, and then find T".

In a piece of text, this expression will find Cat,cot,czt, even a string with a literal value of c.t (c, period, T), but does not include CT or coot.

In regular expressions, spaces are valid. The regular expression ' C t ' means "find ' C ' first, then find the space, and then find ' t '".

Any meta-character that is escaped with a backslash will become a literal value. So the above regular expression c\.t means "Find C first, then find the stop, then find T".

A backslash is a meta-character, which means that it can also be escaped with a backslash. So the regular expression c\\t means "Find C first, then find the backslash, and then find T".

Attention! In some implementations, the. Matches any character except the line break. This means that the line break will also change in different implementations. To view your document. In this post, I'll make sure. matches any character.

In other cases, there is usually a marker to adjust this behavior, that is, ' dotall ' or similar markers

Practice
Using the regular expression you have learned in the dictionary, match a word with two Z, where the two z is as far away from the other as possible.

Answer
Your final regular expression should be that z.......z will match up to four words: Razzamatazz,razzamatazzes,zwischenzug and zwischenzugs.

Practice
In the book "Time Machine", use regular expressions to find sentences that end with prepositions.

Answer
Your regular expression should look like this up\.

Character class (Character classes)
A character class is a collection of characters in square brackets. means "find any one of the characters in the collection."

The regular expression c[aeiou]t means "Find C followed by a vowel letter and find T". In a piece of text, it will match to Cat,cet,cit,cot and cut.
The regular expression [0123456789] indicates that a number is found
Regular expressions [a] and a have the same meaning: "Find a"
Some examples of escaping:

\[a\] means "find an opening parenthesis followed by a, followed by a closing parenthesis."
[\[\]ab] means "match a left or right parenthesis or a or B".
[\\\[\]] means "match a backslash or an opening parenthesis or a closing parenthesis". Oh )
Sequential and repeating characters in character classes are not important. [DABAAABCC] is the same as [ABCD].

Important Tips
The rules within the character class are different from those within the character class. Some characters play the role of metacharacters inside character classes, but they act as literals outside the character class. There are some characters doing the opposite thing. Some characters are metacharacters in both cases, but they represent different meanings in each case.

In particular,. means "match any character", but [.] Indicates "match period". Can't and for a talk.

Practice
In conjunction with the current study, in the dictionary, use regular expressions to find words with consecutive vowels and consecutive consonants.

Answer
[Aeiou] [Aeiou] [Aeiou] [Aeiou] [Aeiou] [Aeiou] matches to six vowel words euouae and euouaes, while horrible [bcdfghjklmnpqrstvwxyz][bcdfghjklmnpqrstvwxyz][bcdfghjklmnpqrstvwxyz][ bcdfghjklmnpqrstvwxyz][bcdfghjklmnpqrstvwxyz][bcdfghjklmnpqrstvwxyz][bcdfghjklmnpqrstvwxyz][ BCDFGHJKLMNPQRSTVWXYZ][BCDFGHJKLMNPQRSTVWXYZ][BCDFGHJKLMNPQRSTVWXYZ] found a gorgeous sulphhydryls with 10 consonants. We will soon see how these expressions of horror can be simplified.

Character class interval (ranges)
You can use hyphens in character classes to denote a range of letters or numbers:

Both [b-f] and [bcdef] mean "find a B or C or D or E or F".
[A-z] and [abcdefghijklmnopqrstuvwxyz] both denote "matching uppercase letters".
[1-9] and [123456789] both indicate "match a non-0 number".
Hyphens do not have a special meaning when used outside the character class. The regular expression A-Z means "find a followed by a hyphen, then match a Z".

intervals and individual characters may coexist in the same character class:

[0-9.,] means "match a number or a period or a comma".
[0-9a-fa-f] means "match one hexadecimal number".
[a-za-z0-9\-] means "match an alphanumeric character or hyphen".
Although you can try to end in a range of non-alphanumeric characters (such as abc[!-/]def), this syntax in other implementations is not necessarily true. Even though the syntax is correct, it is difficult to see which character is contained within this interval. Please use with caution (I mean don't do this).

Similarly, the range of the interval endpoints should be the same. Even though such expressions as [A-z] are legal in the implementation of your choice, the results may not be as you wish. (Supplementary: can have z to a interval range).

Attention. The interval is the interval of the character, not the interval of the number. The regular expression [1-31] means "find a 1 or a 2 or a 3", not "find an integer from 1 to 31".

Practice
Using current learning, write a regular expression that finds dates in YYYY-MM-DD format.

Answer
What we can write now is [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]. In the same way, we will be able to simplify this equation very quickly.

Negation of the character class (negation)
You can negate a character class by using the caret (translator Note: ^) in the first place.

[^a] means "matches any character except a".
[^a-za-z0-9] means "finding a character that is not a letter and not a number."
[\^ABC] means "find an caret or a or B or C".
[^\^] means "any character except the caret is found." Oh )
Practice
In the dictionary, use regular expressions to find the inverse of the rule "I is in front of E and does not come out of the back of C".

Answer
CIE and [^c]ei will find a lot of counter-examples, such as Ancient,science,veil,weigh.

Character class Supplement
The regular expression \d meaning is consistent with [0-9]: "Matches a number". (to match a backslash followed by a D, the \\d can be used.) )

The meaning of \w is consistent with [0-9a-za-z_]: "Matches a word character (translator Note: letter or number or underscore or kanji)".

\s means "match any whitespace character (space, tab, enter, or line break)."

In addition

\d with [^0-9]: "Matches any non-numeric character".
\w [^0-9a-za-z_]: "matches any non-word character (translator Note: Matches any character that is not a letter, number, underscore, kanji)".
\s "matches any character that is not a whitespace character."
These character classes are common and you have to learn.

You may have noticed, period. Essentially, a character class that contains any character.

Many implementations provide a number of additional character classes or tags that overwrite characters outside of the ASCII range by extending the existing character class. Tip: Unicode contains more "numeric characters" than 0 to 9, which also applies to "words" and "spaces." Note that your document is written.

Practice
simplifies the regular expression [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].

Answer
\d\d\d\d-\d\d-\d\d.

Multiplier (multipliers)
You can use a multiplier after a literal or character class followed by a curly brace.

The regular Expression A{1} is the same as a, which means "match a A".
A{3} says "Find a A and then a, and finally find a".
A{0} means "match null character". For its part, this does not seem to be useful. If you use the expression in any piece of text, you'll get a match immediately at the endpoint you just started searching for. The result is true even if your text is an empty string.
A\{2\} means "Find A, follow an opening brace, then match a 2, followed by a closing brace".
In character classes, curly braces do not have a special meaning. [{}] means "match an opening curly brace or a closing brace".
Attention. The multiplier has no memory. The regular expression [abc]{2} means "match A or B or C, and then match A or B or C." This is the same as "match AA or AB or AC or BA or BB or BC or CA or CB or CC". This is different from the meaning of "match AA or BB or cc"!

Practice
Simplify the following regular expressions:

Z.......z
\d\d\d\d-\d\d-\d\d
[Aeiou] [Aeiou] [Aeiou] [Aeiou] [Aeiou] [Aeiou]
[BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ] [BCDFGHJKLMNPQRSTVWXYZ]
Answer
Z.{7}z
\D{4}-\D{2}-\D{2}
[Aeiou] {6}
[BCDFGHJKLMNPQRSTVWXYZ] {10}
Multiplier interval
Multipliers may have intervals:

x{4,4} is the same as x{4}.
Colou{0,1}r means "match colour or color."
a{3,5} means "match aaaaa or AAAA or AAA".
It is important to note that a longer match is preferred because the multiplier is greedy. If the text you entered is I had an aaaaawful day, the regular expression will be matched to aaaaa in aaaaawful. The match is not stopped after the third A.

The multiplier is greedy, but it does not ignore a better match. If your input text is I had an aaawful daaaaay, then this regular expression will find AAA in Aaawful in the first match. It will continue searching and find aaaaa in Daaaaay only when you say "find me another match."

The multiplier interval may be an open interval:

A{1,} means "one or more aces found in a column". Your multiplier, however, will be greedy. Once the first A is found, it will match as many as possible to a.
. {0,} means "match any situation". Whatever your input text is-even empty-the regular expression will match the entire string and return it to you.
Practice
Write a regular expression that matches the double-quote string. The string can have any number of characters at the same time.

With what you've learned, modify the regular expressions above to find double-quoted strings, but there are no extra double quotes between them.

Answer
Multiplier supplement
The meaning of the delegate is the same as {0,1}. For example, colou?r means "match colour or color".

* Equals {0,}. For example,. * means "match everything", as mentioned above.

+ equals {1,}. For example, \w+ means "match a word". The word here is a sequence of 1 or more word characters, like _var or AccountName1.

These multipliers are common and you have to master them. And also:

\?\*\+ says "match a question mark, then find an asterisk and follow a plus sign".
[? *+] means "find a question mark or an asterisk or a plus sign".
Practice
Simplify the following regular expression:

". {0,} "and" [^ "]{0,}"
X?x?x?
y*y*
z+z+z+z+
Answer
". *" and "[^"]* "
x{0,3}
y*
Z{4,}
Practice
Write an expression to find two words separated by non-word characters. What should I do if I change to three words or six words?

Answer
\w+\w+\w+,\w+\w+\w+\w+\w+,\w+\w+\w+\w+\w+\w+\w+\w+\w+\w+\w+. Of course, we'll learn how to simplify them later.

Inert (non-greed)
Regular expression ". *" means "find a double quote, then find as many characters as possible, and finally find a double quote". Note the internal characters that are matched by. *, most likely contain multiple double quotes. This is usually not very useful.

Multipliers can be lazy by appending a question mark. This reverses the order of precedence:

\d{4,5}? means "match \d\d\d\d or \d\d\d\d\d". is in fact consistent with \d{4} behavior.
Colou?? R is colou{0,1}?r, which means "find color or colour". Consistent with colou?r behavior.
".*?" means "match a double quote, followed by a minimum number of characters, followed by a double quote". This is not like the above two examples, which is actually useful.
Branch (alternation)
You can use pipe symbols to match multiple choices:

Cat|dog means "match cat or dog".
red|blue| and red| | Blue and |red|blue all mean the same thing, "match red or blue or empty string."
A|b|c is like [ABC].
cat|dog|\| means "match cat or dog or pipe symbol".
[Cat|dog] means "find A or C or D or D or G or O or T or a pipe symbol".
Practice
Do your best to simplify the following regular expressions:

S|t|u|v|w
Aa|ab|ba|bb
[abc]| [^ABC]
[^ab]| [^BC]
[AB] [AB] [AB]? [AB]?
Answer
[S-w]
[AB] {2}
.
[^b]
[AB] {2,4}
Practice
Writes a regular expression that matches integers between 1 and 31 (inclusive). Remember, [1-31] is not the right answer.

Answer
There are several ways to do this. I think one of them [1-9]| [12] [0-9]|3[01] is the most readable.

Combination (Grouping)
You can use parentheses to combine expressions:

Find a day in the week, using (mon| tues| wednes| thurs| fri| satur| Sun) Day.
(\w*) ility is equivalent to \w*ility. means "find words ending with ility". Why the first form is more useful, you'll see later ...
\ (\) means "match a left parenthesis and then match a closing parenthesis."
[()] means "match one left parenthesis or one right parenthesis".
Practice
In the book Time Machine, use regular expressions to find the sentences wrapped in parentheses. Next, modify your answer to find sentences that are enclosed in parentheses but have no parentheses inside.

Answer
\ (. *\), followed by \ ([^ ()]*\).

The combination may contain an empty string:

(red|blue|) Indicates "match red or blue or empty string".
ABC () def equates to ABCdef
You might use a multiplier in a combination:

(red|blue)? equivalent to (red|blue|).
\w+ (\s+\w+) * represents "find one or more words, which are separated by spaces."
Practice
Simplifies \w+\w+\w+\w+\w+ and \w+\w+\w+\w+\w+\w+\w+\w+\w+\w+\w+.

Answer
\w+ (\w+\w+) {2},\w+ (\w+\w+) {5}.

Word boundary (word boundaries)
The word boundary is the position between a word character and a non-word character. Remember, a word character is \w, it is [0-9a-za-z_], a non-word character is \w, that is, [^0-9a-za-z_].

The beginning and end of the text are always treated as word boundaries.

Input text It's a cat has eight word boundaries. If we append a space to the cat, there will be nine word boundaries.

The regular expression \b means "match a word boundary".
\b\w\w\w\b means "match a three-letter word."
A\ba says "Find a, follow a word boundary, and then find B". No matter what the input text is, this regular expression will never successfully find a match.
The word boundary is not a character. They are 0 wide. The following regular expression represents the same meaning:

(\bcat) \b
(\bcat\b)
\b (cat) \b
\b (cat\b)
Practice
Finds the longest word in a dictionary.

Answer
After some experiments and errors, the regular expression is \b.{45,}\b, finding the only result in the dictionary: pneumonoultramicroscopicsilicovolcanoconiosis.

Row boundary (line boundaries)
Each piece of text is decomposed into one or more lines separated by a newline character, like this:

Yes
Line break
Yes
Line break
...
Line break
Yes
Note that the text does not end with a newline character, but ends with a line. However, any row, including the last line, can contain 0 characters.

The starting line position is between a line break and the first character of the next line. As with word boundaries, the beginning of the text is counted as a starting line.

The end line position is between the last character of the row and the line break. As with word boundaries, the end of text is counted as the end of the line.

So we are all subdivided into:

Start line, line, end line
Line break
Start line, line, end line
Line break
...
Line break
Start line, line, end line
On this basis, there are:

The regular expression ^ means "match start line".
The regular expression $ means "match end line".
^$ means "match blank line".
^.*$ will match the entire text, because the newline character is a char, so. It will match. To match a single line, use the lazy multiplier, ^.*?$.
\^\$ means "match a sharp symbol followed by a dollar sign".
[$] means "match a dollar symbol". However, [^] is an illegal Tanzhong expression. Remember that the sharp symbols have different special meanings when they are in square brackets. Put the pointed symbol in the character class, so use [\^].
Like word boundaries, line boundaries are not characters. They have a width of zero. The following regular expression represents the same meaning:

(^CAT) $
(^cat$)
^ (CAT) $
^ (cat$)
Practice
Use regular expressions to find the longest line in the time machine.

Answer
This version of the Gutenberg publishing house has a line of up to 73 characters, using the ^. {73,}$-expression. Many lines are of this length.

Text Bounds (boundaries)
Many implementations provide a mark by changing it to change the meaning of ^ and $. From line start and line end to text start and end of text.

Other implementations offer separate metacharacters \a and \z for this purpose.

Capture and replace

This is where the regular expression begins to become exceptionally powerful.

Capturing groups
As you already know, parentheses are used to represent groups. They can also be used to capture substrings. If the regular expression is a very small computer program, this capturing group is its output (part of).

The regular expression (\w*) ility means "find a word that ends with ility." Capturing group 1 is a \w* that matches part of the content. For example, if our text contains a word accessibility, capturing group 1 is accessib. If our text itself contains only ility, capturing group 1 is an empty string.

You can have multiple capturing groups, which can even be nested. Capturing groups are numbered from left to right. Just calculate the left parenthesis.

Suppose we go to the regular expression is (\w+) had a ((\w+) \w+). If our input text is I had a nice day, then

Capturing group 1 is I.
Capturing group 2 is nice day.
Capturing group 3 is nice.
In some implementations, you may be able to access capture group 0, which is the full match: I had a nice day.

Yes, it does mean the parentheses are a bit repetitive. Some implementations provide a separate syntax for declaring "non-capturing groups", but this syntax does not conform to the standard, so here we do not involve.

The number of capturing groups from a successful return match is always equal to the number of capturing groups in the original regular expression. Remember this because it can help you understand some confusing situations.

The regular expression ((cat) |dog) means "match cat or dog". There are always two groups of capturing groups. If our input text is dog, then capturing group 1 is dog and capturing group 2 is an empty string because another selection is not being used.

The regular expression A (\w) * means "match a word that begins with a." There is always only one capturing group (translator Note: Remove capture group 0):

If the input text is a, capturing group 1 is an empty string.
If the input text is AD, capture Group 1 is d.
If the input text is avocado, capture Group 1 is v. However, capturing group 0 will be the whole word, avocado.
Replace
Once you have used a regular expression to find a string, you can specify another string to replace it. The second string is replaced with an expression. First, it's like:

Traditional Replace dialog box
Java's string.replace () function
PHP's String.Replace () function
Wait a minute
Practice
Use R to replace all vowel letters in the time machine. Make sure you use the correct capitalization!

Answer
Use regular expressions [aeiou] and [Aeiou], respectively, to replace expressions R and R.

However, you can reference the capturing group in your substitution expression. This is the only thing you can do in a replacement expression, it's incredibly powerful, because it means you don't have to completely destroy what you just found.

For example, you try to replace the American date (MM/DD/YY) with the date (YYYY-MM-DD) in ISO 8691 format.

Start with regular expressions (\d\d)/(\d\d)/(\d\d). Note There are three capturing groups: month, day, and two digits of the year.
Reference a capturing group by using a backslash and a capturing group number. So, your substitution expression is 20\3-\1-\2.
If our input text is 03/04/05 (representing March 4, 2005), then
Capture Group 1 is 03
Capture Group 2 is 04
Capture Group 3 is 05
Replacement string for 2005-03-04
You can refer to the capturing group multiple times in the substitution expression.

Use regular Expressions ([Aeiou]) and replace expression \1\1 to double your vowels.
The backslash in the substitution expression must be escaped. For example, you have some text that is used in the literal value of a computer program. That means you need to place a backslash in front of each double quote or backslash in normal text.

In the regular expression ([\ \ \]), capturing group 1 is a double quote or a backslash.
Replace expression \\\1, a literal backslash followed by a matching double quote or backslash.
Back reference (Back-references)
You can refer to the same capturing group in the same expression. This is referred to as a back reference.

For example, calling the preceding expression again [abc]{2} means "match AA or AB or AC or BA or BB or BC or CA or CB or CC". But the expression ([ABC]) \1 means "match AA or BB or cc".

Practice
In the dictionary, find the longest word that appears two times the same string (for example, Papa, Coco).

Answer
\b (. { 6,}) \1\b match to Chiquichiqui. If we don't care about the complete word, we can give up the word boundary assertion, using (. { 7,}) \1 will find Countercountermeasure and countercountermeasures.

Programming with regular expressions

Some specific points of note:

Excessive backslash syndrome (excessive backslash syndrome)
In some programming languages, such as Java, there is no special support for strings that contain regular expressions. Strings have their own escape rules, which overlap with the escape rules of regular expressions, often leading to too many backslashes (overload). For example (or Java):

To match a number, the regular expression \d into a string re = "\\d;" in the source code.
In order to match a double-quoted string, "[^"]* "becomes string re =" \ "[^\"]*\ "";.
To match a backslash or an opening parenthesis or a square bracket, the regular expression [\\\[\]] becomes a string re = "[\\\\\\[\\]]";.
string re = "\\s"; and string re = "[\t\r\n]"; Note the different escape "priority".
In other programming languages, a special tag is used to identify a regular expression, usually a forward skew bar/. Here are some examples of javascript:

To match a number, \d becomes var regExp =/\d/;.
Match a backslash or a left bracket or a right parenthesis, var regExp =/[\\\[\]]/;.
var regExp =/\s/, and var regExp =/[\t\r\n]/;
Of course, this means that the forward slash, not the double quotation mark, must be escaped. Match the preceding part of the URL: var regExp =/https?:\ /\//;。
Based on this, I hope you understand why I have repeatedly mentioned to you the anti-oblique lever.

Offset (offsets)
In a text editor, the search begins at the point where you have the cursor. The editor will start searching for text and then stop at the first match. The next search will start at the right of the first place to complete the search.

When programming, the offset of the text is necessary. This offset is explicitly supported in the code, either in an object containing text (such as Perl), or in an object containing a regular expression (such as javascirpt). (in Java, this is a string of regular expressions and compound objects.) In any case, the default value is 0, which indicates the beginning of the text. After the search, the offsets are automatically updated or returned as part of the output.

In any case, it is often easy to use loops to solve this problem.

Attention. It is entirely possible to match a regular expression to an empty string. One simple example that you can implement immediately is a{0} in this case, the new offset equals the old offset, resulting in a dead loop.

Some implementations may protect you from these situations, but check the corresponding documentation.

Dynamic Regular Expressions
Be careful when constructing a regular expression string dynamically. If you are using a string that is not fixed, it may contain unexpected meta characters. This can result in a syntax error. Worse, it may produce a regular expression that is syntactically correct, but behaves unpredictable.

Java code with Bugs:

String Sep = system.getproperty ("File.separator");
string[] directories = filepath.split (SEP);
This bug is: String.Split () thinks that Sep is a regular expression. But under Windows, Sep is the string "\ \" consisting of a diagonal bar. This is not a syntactically correct regular expression. The result: an abnormal patternsyntaxexception.

Any good programming language provides a mechanism to escape all meta characters that appear in a string. In Java, you can do this:

String Sep = system.getproperty ("File.separator");
string[] directories = filepath.split (Pattern.quote (Sep));
Regular expressions within a loop
Compiling a regular expression string into a running "program" is a costly operation. If you can avoid doing this in a loop, you can improve program performance.

Various recommendations

Input validation
Regular expressions can be used for user input validation. But overly rigorous validation can make users feel uncomfortable. Here are a few examples:

Payment card number

I enter my card number on the webpage such as 1234 5678 8765 4321. Will be rejected by this site. Because it uses \d{16} for validation.

The regular expression allows spaces and hyphens to appear.

In fact, why not just remove all non-numeric characters and then verify? To do this, replace the expression with a regular expression \d and an empty string.

Practice
Write a regular expression that validates my card number without having to delete non-numeric characters.

Answer
\d* (\d\d*) {16} is a way to implement multiple implementations.

Name

Do not use regular expressions to validate the user's name. In fact, do not need to verify the name, you powerless.

Falsehoods programmers believe about names mentioned:

The name cannot contain spaces.
Names cannot contain punctuation.
Names can only use ASCII characters.
The name will be limited to any particular character set.
The name is always as long as the M character.
People always have and only one name to use.
People always have and have only one middle name.
People always have and have only one surname.
...
e-mail address

Do not use regular expressions to verify the e-mail address.

First of all, it's hard to guarantee the correctness. e-mail addresses do conform to a regular expression, but the expression is long and complex reminiscent of the end of the world. Any abbreviation can result in omission (false negatives). (Do you know?) The email address can contain comments! )

Second, even if the e-mail address provided conforms to the regular expression, it does not prove its existence. The only way to verify an e-mail address is to send an e-mail message to it.

Mark
In a formal application, do not use regular expressions to parse HTML or XML. Parsing Html/xml is

It is impossible to use simple regular
Generally difficult
A problem that has been resolved.
Find an existing analytic library to take care of the work for you.

This is the 55 minute content.

Summarize:

Literal value: A b C d 1 2 3 4 and so on.
Character class:. [ABC] [A-z] \d \w \s
. denotes "any character"
\d means "a number"
\w = "One word character", [0-9a-za-z_]
\s means "a Space, tab, carriage return, or a newline character"
Negative character class: [^abc] \d \w \s
Multiplier: {4} {3,16} {1}? * +
? means "no or one"
* indicates "no or more"
+ means "one or more"
Multipliers are greedy unless you use them later?
Branches and combinations: (septem| octo| novem| DECEM) ber
Word, line, and text boundaries: \b ^ $ \a \z
Reverse capture group: \1 \2 \3 and so on. (valid both in substitution expressions and match expressions)
Meta-character list characters:. \ [ ] { } ? * + | ( ) ^ $
Used in character classes to list characters: [] \-^
You can always use a backslash to escape a meta character: \
Thanks for reading
Regular expressions are ubiquitous, incredibly useful. People who spend a lot of time editing text and writing computer programs should learn how to use them. So far, we've only touched the tip of the iceberg.

Practice
Continue reading the corresponding document for the regular expression implementation of your choice. I guarantee that there are more features that are not covered by the ones we are discussing here.



Regular expressions

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: