The story of the senior architect sum: Regular! It's easy to get started.

Source: Internet
Author: User
Tags control characters printable characters

| Story Background

The workplace is like a battlefield! Sum led three small teammates for two weeks, successfully the agent function to dry out. If the product manager is the Devil's commander, the test is the Devil's instructor. This two weeks, let sum deeply appreciate what is the sunrise of X city.

But then again, fighting time is sour and sweet, often recalled, the corners of the mouth will be a flow of sugar-like smile, that is the back of the young moon, it is young when the outline of the Rising sun.

I don't know if other people who love the program think so, but sum thinks so.

Sum is still in step.

This day sum early to the company, 8 points, the sun just slightly hot. The morning mist has not yet dispersed, standing in the floor-to-ceiling window to look outward, the world is like opening steamed buns steaming hot steam.

Sum turn on the computer and take advantage of the time it takes to brew a cup of coffee. Back to my seat, the computer was already open.

Looked at the work schedule, as usual, audit code, grooming requirements, meetings, meetings, or meetings.

Then start with the audit yesterday code, sum from git download yesterday's code (because x company has a fixed release date, so the code is submitted, the next day by the project leader Sum and other responsible for audit), after the download is completed, Sum opens the Zendstudio (because sum is responsible for the company's PHP module development, So be responsible for auditing this part of the code), import the latest project, control the update log, open the corresponding modified files, drink coffee again, look at the code.

Sum when you see a function library file, found a curlapi ($url, $param, $type =1) function, the implementation process has a piece of such code, looks very awkward, as follows:


1 if(Strpos($response, "__callback (")!==false){2   $response=Str_replace(Array("__callback (", ")"),Array("",""),$response); 3 }4 $returnData= Json_decode ($response,true);5 return $returnData;

The result looks normal, that is, curl an API, the returned $response may contain the __callback string, not pure JSON text, so the teammate in return to make a judgment, if there is a __callback (the string, will be replaced by empty, More recent JSON-to-array operations. But, in sum, there's a ray buried in it, what Ray? If the interface provider which day mood is not good, the __callback replaced the __query, and do not notice to the technical department, then this usage is wasted, how to do? The first idea of sum is to use regular expressions to match such special response values.

Heart has the idea, sum wants to do a tutorial, to the members of the team to popularize, so, sum opened the PPT, prepared training materials ....

| Demand Analysis

  Sum to do a simple and easy-to-preach regular expression tutorial.

| Open Dry

  Sum think, to be able to subconsciously think of using regular expression, that must be the regular expression of the table, to memorize, into the bones, and then after a number of demo exercises, form if you need to deal with strings or text (such as reptiles), the first thought is the regular expression. Therefore, sum is in the first chapter of the PPT, the regular expression of the rule table is posted.

  Normal characters: ordinary characters include all printable and non-printable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation marks, and some other symbols . such as/demo/, is the match string or text, there is no demo this word

  nonprinting characters (meaning that there are some characters in the computer that do exist, but they cannot be displayed or printed out, in the case of an ASCII code table, ASCII values in 0-31 are control characters and cannot be displayed and printed):

                                nonprinting characters can also be part of a regular expression. The following table lists the escape sequences that represent nonprinting characters :

character Description
\cx Matches the control character indicated by X. For example, \cm matches a control-m or carriage return. The value of x must be one of a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.
\f Matches a page break. Equivalent to \x0c and \CL.
\ n Matches a line break. Equivalent to \x0a and \CJ.
\ r Matches a carriage return character. Equivalent to \x0d and \cm.
\s Matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v]. Note Unicode Regular expressions match full-width whitespace characters.
\s Matches any non-whitespace character. equivalent to [^ \f\n\r\t\v].
\ t Matches a tab character. Equivalent to \x09 and \ci.
\v Matches a vertical tab. Equivalent to \x0b and \ck.



Special characters: the so-called special characters, which are some characters with special meanings, such as the runoo*b in the above words, simply means to represent any string. If you want to find the * symbol in a string, you need to escape the *, which is preceded by a \: runo\*ob match Runo*ob.

Many metacharacters require special treatment when trying to match them. To match these special characters, you must first make the characters "escaped," that is, the backslash character \ is placed before them.

The following table lists the special characters in the regular expression:

Special Characters Description
$ Matches the end position of the input string. If the Multiline property of the RegExp object is set, then $ also matches ' \ n ' or ' \ R '. To match the $ character itself, use \$.
( ) Marks the start and end positions of a subexpression. Sub-expressions can be obtained for later use. To match these characters, use \ (and \).
* Matches the preceding subexpression 0 or more times. To match the * character, use \*.
+ Matches the preceding subexpression one or more times. to match the + character, use \+.
. Matches any single character except for the newline character \ n. to match. , please use \. 。
[ Marks the beginning of a bracket expression. to match [, please use \[.
? Matches the preceding subexpression 0 or one time, or indicates a non-greedy qualifier. to match? characters, use \?.
\ Marks the next character as either a special character, a literal character, a backward reference, or an octal escape. For example, ' n ' matches the character ' n '. ' \ n ' matches line breaks. The sequence ' \ \ ' matches ' \ ', while ' \ (' then Match ' (".
^ Matches the starting position of the input string, unless used in a square bracket expression, which indicates that the character set is not accepted at this time. To match the ^ character itself, use \^.
{ The start of the tag qualifier expression. To match {, use \{.
| Indicates a choice between the two items. to match |, please use \|.

qualifier: The qualifier is used to specify how many times a given component of a regular expression must appear to satisfy a match. have * or + or? or {n} or {n,} or {n,m} altogether 6 kinds.

                                   The qualifiers for a regular expression are:

character Description
* Matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * Equivalent to {0,}.
+ Matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but not "Z". + equivalent to {1,}.
? Matches the preceding subexpression 0 or one time. For example, do (es) can match "do" in "Do", "does" in "does", "Doxy" in "Do"? Equivalent to {0,1}.
N N is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '.
{N,} N is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ', but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '.
{N,m} Both M and n are non-negative integers, where n <= m. Matches at least n times and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' O? '. Note that there can be no spaces between a comma and two numbers.

  Locators: Locators enable you to pin regular expressions to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word. A locator is used to describe the bounds of a string or word, and^ and $ refer to the beginning and end of a string, respectively .

        \b describes the front or back boundaries of a word, and\b represents a non-word boundary.

The locators for regular expressions are:

character Description
^ Matches the starting position of the input string. If you set the Multiline property of the RegExp object, ^ will also match the position after \ n or \ r.
$ Matches the position of the end of the input string. If you set the Multiline property of the RegExp object, the $ will also match the position before \ n or \ r.
\b Matches a word boundary, which is the position between the word and the space.
\b Non-word boundary match.

Be sure to memorize!! Be sure to memorize!!! Be sure to memorize!!!!!!!!!!

Sum voluminous in the PPT write down this line of words, but also specifically labeled this line, so that the word appears more eye-catching!

With the code to deal with the sum of many years, know the most persuasive way, is to write a demo, the knowledge point of disassembly. The use of the most regular expression of the scene, is the crawler (application level, of course, compiler and so on deep-seated use), so he chose a Web page source (link).

Save the source text, Sum decided to write the regular version of the demo, then in the PPT of their own demo sequence Knocked Up:

Get the title of a webpage--that's how easy getting Started

"Delete those useless labels-you want to get the first"

"Tag by class name--" Where to take it?

"Batch fetch matching label--my subset I'm the Boss"

"Paired labels – making data more accurate"

After the index directory of the demo is written in ppt, sum will encode these directories, so it is urgent to open them immediately.

  Get the title of a webpage--that's how easy getting Started

  All of the static pages know that the title of the page is defined in the header and is included between the <title></title> tags.


Also, the canonical HTML code, the title is unique, in other words, the title tag, only this one. So for the crawl of the HTML title, it becomes easy, just write a rule, match the HTML title tag, and then give the regular expression to make a sub-rule to extract the content of the title.

Sum then opens the editor and writes a function in PHP that matches the HTML code for that part and takes out the title.


On the graph is the GetTitle () function of sum writing, the core regular expression is /\<title\> (. *) \<\/title\>/; because <,/, > is the symbol of regular expression, So when PHP makes regular rules, it is necessary to escape these characters (PS: What other characters need to be escaped?). You can reply in the comments section)

Sum is annotated on this rule to explain why this rule can be matched to the title.

1.<title> find the string starting with <title> in the text and match it back from that position

2. (. *) sub-match rule,. Represents any character other than line break, we analyze the HTML code we want to crawl,<title> the story of Senior architect Sum: (Mysql) InnoDB, processing of transactions in stored procedures-programmer VIP- Blog Park </title>, we want to crawl the content, exactly in one line, so suitable for use. To do a wildcard, the * number represents 0 to multiple and. Matching characters. That is, if the content is on one line, it can be matched.

3.</title> in the rule, the specified character is written as the end, and the matching end character is the string.

4. As we all know, there is a greedy pattern and a non-greedy mode (that is, lazy mode), the use of Preg_match is lazy mode, which just means that only the first satisfied with <title> start with the </title> end of the character match to the end of the match. That's what we want. After all, the site has only one title.


The above matching results are printed as follows:


Sum wrote these comments, and then looked at the code, but found a problem, if the resulting text, the title of the content if it contains a newline character what to do? Just like the code variant below ↓


The question is, how should the regular expression of this code be written?

This is the problem that sum throws to the team members, is watching "the story of senior architect sum" you, whether you have your answer? Welcome to reply in the comment area

"This chapter has not been continued"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.