20 minutes out of search engine building regular expressions

Source: Internet
Author: User

Regular expressions, in modern languages, can be said to be ubiquitous. The regular expression is often accompanied by a lot of symbols, I often see others in the search engine and code interface to switch back and forth in order to form a reliable regular expression. The main purpose of my writing this article is to hope to reach the end of the reading, you can no longer in the construction of regular expressions query search engine.

The main core of regular expression is the symbol, because the search letter should be for a normal person is not difficult. And these symbols are often a lot of people to search the search engine main content, but you calm down to think about, the computer inside the common symbols are so many. Below is I slightly statistics the next gtest inside uses the symbol to remove the last and the newline character, most symbols in any computer related concept will often encounter, then ends this some nonsense, enters the real topic.

One, ordinary characters

Although it says, ordinary character searches are used by even a non-computer professional. The general step is to first find the application search function, then enter the desired string, and finally click Search. Wait a little while to find or not find what you need.

Let's do a little bit of thinking here, for example, I want to search a big, computer thinking process to answer this, first find a B, and then find a I, finally find a G, found the return, can not find on the exit. Of course, there is a character search algorithm that matches the last beginning of the character. But anyway, there's a natural way to hide a search comparison order, keep this in mind, and then continue reading the rest of the section.

Two, brackets three brothers

To ask what the most common symbols are in programming, I will definitely choose all kinds of parentheses, curly braces, brackets, and parentheses. It's also true in regular expressions, and it's not an exaggeration to say, remember or master the three brackets, you've mastered more than half of the regular expression content. In order of size, start with curly braces.

curly braces , a{3}, expressed in regular expressions multipiers, the word I think the most appropriate translation is a multiplier, that is, doubling something. In search, only the content of the search is doubled. For example, a{3} equals AAA, with 3 times times the symbol inside the curly braces. The multiplier can also specify a range of multiples, such as a{1,3} to search for A,aa or AAA. Studious friends will ask, if it is not the lower limit or the upper limit will be what it looks like? That is a{,3} or A{1,}, first, there is no limit is meaningless, you will search for nothing, after all, do anything to have a lower limit. And no upper limit means at least one times, not capped, with a professional computer terminology is greed, find 1, and then find 1, until found. In a more popular, if you use A{1, the results of 1 A to n a will be searched.

brackets , [aeiou], the broadest symbol used in regular expressions. This scientific name is character class, I think the translation into the symbol class is a bit obscure, I think this is understood as a set of the most appropriate, such as aeiou, means the search string aeiou, but [aeiou] means or search for a, or search E, or search I, and so on. If I have to pull the class above, I think it's just a search for any one of these. So according to this idea, if it is [Aeeiuooau], is actually equivalent to [aeiou], so it is not a class, rather than a set.

So that's the last one, the parentheses , ((one) |two), and the vast majority of mathematical concepts that represent precedence, in regular expressions the parentheses represent the grouping. So the question is, what is a group? The most important function of grouping in regular expressions is to provide a numbering mechanism, which can be easily used in many places, such as replacement. This to the back to elaborate, because there are many symbols have not been introduced.

This even if the three brothers in parentheses specific role, remember I said in the first part of the must remember the natural hidden search comparison order? It's the first time you're going to feel it. In regular expressions, always remind yourself to remember this, such as the following string of search strings:

A[AEI]{3}

This search string can be interpreted as a search for a, then search for [Aei]{3}, and later this section indicates that the first search for a or E or I, and finally the three or more characters will appear three times. A bit around, for example AAAA meets this condition, first searches for a, then a is any one of the letters in the AEI, and the next A is the same as the last A. In this can be inferred, aeai,aaee and so on are consistent, and what axxx certainly not meet, because the second order in the condition is not satisfied, and AEI also not, because the third is not satisfied, EEEE also can not, because the first is not satisfied.

So the regular expression and regular search is the same, just change a form, just remember two points,

Unfold in the order of one,

Any fit can only be combined once, and the search is based on a single character,

Remember these two points and let you know why there is a 2nd.

As for the function of parentheses, is a multiplier, any and grouping three functions.

Three or three kinds of lines

There are many kinds of lines in the programming language, what underline, the underline. In the regular expression is mainly two kinds of lines, one is the middle line, one is back slash, that is, and there is a vertical bar.

The vertical bar has been described above, as in many programming languages, representing or, One|two, the search string one or the search string.

the dash is very good memory, in our life also used in the line, indicating range, 1-9, it is natural to read 1 to 9,a-z, naturally read A to Z. In the regular expression, to match the above-mentioned brackets use, because only then, the computer will know that you are not searching for the character "1-9", but the scope of any one of the number, you see, the concept of the collection, so want to search 1-9 any number should be written as [1-9].

But there's a trap here,

[1-30]

Although I have read 1 to 30 in my mind, in regular expressions it is important to keep in mind the two principles of the previous section, the search is in alphabetical order, and the symbols are only combined once. The above 1-30, in fact, should be 1-3 and 2 parts, if written [(1-3) 0] may be clearer, so this means that the search for any number 1-3 or 0, and then the strong point of translation, should be 0-3 of any one by one number. If you really want to look for any number in the 1-30, combine the knowledge that has been introduced to write a workable solution.

[0-9]| [1-2] [0-9]|30

Let us parse this string of characters, the vertical bar indicates or, so the above means to find a number 0 to 9, or the first bit is 1-2, the second is 0-9, that is 10-29, and finally 30, which completes the function described above.

Finally, back slash, "\", this symbol in the meaning of the regular expression and all other places you can see the escape, so escape is the transformation semantics, such a transformation has two modes, one is to convert meaningful symbols into meaningless ordinary symbols, One is to convert ordinary characters into symbols that have special meanings. For example, if you search for \|a, it means searching for the |a string instead of \ This symbol or a character. I believe that people with computer fundamentals are not unfamiliar, and the other direction is to convert ordinary characters into special meanings, such as:

\w represents all characters in a regular expression, which is equivalent to [a-za-z],w on behalf of Word.

\d represents all numbers, equivalent to [0-9],d represents digit.

And I think the most interesting of the regular expression is the same letter of uppercase, such as \w, note is uppercase, indicating non-character, \d is not a number, I always feel like playing get out you can feel the typing of the negative feelings, so you just have to remember that the upper case character represents the opposite side of the line.

Iv. add multiply and question marks

These three words regular expression inside all said the same kind of meaning, are some kind of multiplier, although the glaring seems to think these three why use in a class, but after the function you think carefully, using these three symbols or there is a certain intuitive reason.

Let's start with the question mark . Expressed in regular expressions 0 or 1 times, equivalent to {0,1}, that is, 0 or 1 times, so say AB? C and Ab{0,1}c are the same, which means to search for ABC or AC,B to appear once or 0 times. I think this question mark is very good, there is a kind of, "eh? Do you have? "The feeling, a question mark is still very image."

The following is a plus , +, indicating that one or more occurrences of the {1,},ab+c equivalent of a string, the string is followed by at least one b followed by a C.

and multiplication Sign , *, indicates that it appears 0 or more times, the equivalent of {0,},ab*c is the equivalent of looking for a string, the string is AC or a followed by at least one B and then a C.

About Plus and multiplication sign also very good memory, you think the plus for the special natural number 0 is influential, 0+1 is 1, and multiplication sign is not, 0*1 is 0, so + represents at least 1, and multiplication means that there are no more than.

Five, other strange symbols

Finally, there are some symbols that I don't know how to call them, although this is the last item in the symbol category, but it doesn't represent these symbols, and the title I do doesn't mean that they use very little in regular expressions, including, ^,$,.

First say the dot ,., the dot indicates any one by one characters, which is equivalent to a lot of times * meaning, but * there are other meanings, such as A.C, means to find a, and then find any one character, and then find a C.

Here are the ^ and $, put the two symbols together because their meaning is closely related, ^ means that the beginning of a line,& represents the end of a row, so ^$ represents any empty line, using this symbol you can complete a row of search function.

However, there is a special case, if ^ appears in the [], the non-meaning, that is, in many languages! , for example [^c] looking for any character other than C, while [^ABC] means searching for non-A and non-B and non-C characters, if you remember the notation of the regular expression I mentioned earlier, it seems that this place is contrary to this principle, but this is not the case, because [] means either, You can understand that there is actually only one character in it, so it is only a combination once.

Vi. Grouping and substitution

Even if the symbol is finished, the last one is a special part, combined with the previous introduction of the knowledge of parentheses, such as I want to find a string (\d) 01, meaning to look for a three-digit number, this three-digit number 0 and 1 end, if you write \d01, the effect of the search is the same,

But with the grouping is not the same, for example, 001, then the regular expression will be the result of the tag group, the parentheses on the top of the 1th group, that is, ' 0 ', and the general programmer will naturally have a problem, why here from the 1 start numbering, and the main idea of the computer inconsistent Ah! So I'm telling you, you're right, actually there really is a No. 0 group, and the No. 0 Group represents the full result, which is 001, and if you use \D01, it doesn't work. The way the regular expressions are grouped is outside and inside, what does that mean if you search for one or both, if the last target file contains a |two, then group 0 is the result of your search, group 1 is also Because both are enclosed in parentheses, group 2 is empty because there is no one at all, and one is the parenthesis.

So there is this kind of thing, then it is convenient to replace, you just use the corresponding group in the corresponding language as a placeholder, and this can be used in most languages.

The basic symbols of regular expressions are these, I hope this article can achieve the purpose that I want to achieve in my heart. This is just a simple entry-level article, in fact, you know something about state machine, according to the above description, the regular expression is actually a state machine implementation, and in this state machine, there are a variety of state changes. Regular expression is a simple and simple, but said difficult is not easy things, and a lot of knowledge, for example, there are many, such as \b said Block start, these as long as the use of more than basically also remember, and as long as the basic principle, I believe that the new to slowly infer still can find the final string.

I tried it, I read this article I wrote about 15 minutes, then this article I called 20 minutes out of the search engine to build regular expressions.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.