String Operations-Regular expression __ string manipulation regular expression

Source: Internet
Author: User
Tags posix expression engine egrep
Regular ExpressionsEdit regular expressions, also known as formal representations, general representations (English: Regular Expression, often abbreviated as regex, RegExp, or re) in code, a concept of computer science. A regular expression uses a single string to describe and match a series of strings that conform to a certain syntactic rule. In many text editors, regular expressions are often used to retrieve and replace text that conforms to a pattern. Many programming languages support the use of regular expressions for string manipulation. For example, in Perl, a powerful regular expression engine was built. The concept of regular expressions was initially popularized by tool software (such as SED and grep) in Unix. Regular expressions are usually abbreviated as "regex", singular with regexp, regex, plural with regexps, regexes, Regexen. Origin Editors Regular expressions [1] the "originator" of    may be traced back to scientists ' early studies of the workings of the human nervous system. Warren McCulloch, of New Jersey, and Walter Pitts, a Detroit in the United States, has developed a new way of describing neural networks in a mathematical way, by creatively describing neurons in the nervous system as small, simple automatic elements. , thus making a great work of innovation. In 1956, a mathematical scientist named Stephen Kleene, who was born to Mark Twain as "one of the most beautiful cities in America", was based on the early work of Warren McCulloch and Walter Pitts, Hartford , published a topic is "Neural network event representation" paper, using the mathematical symbols called regular sets to describe the model, introduced the concept of regular expression. A regular expression is used as an expression to describe what it calls a "regular set of algebras," and thus the term "regular expression" is used. Some time later, it was found that this work could be applied to other areas. Ken Thompson has applied the results to some early studies of computational search algorithms, and Ken Thompson is the main inventor of Unix, the father of the famous Unix people. The Father of Unix introduced this symbolic system to the editor QED, then the Editor on Unix, Ed, and eventually introduced grep. Jeffrey Friedl in his book "Mastering Regular Expressions (2nd edition)" (Chinese version: Proficient in regular expression, has been out to the third edition) to explain this further, If you want to learn more about regular expression theory and history, it is recommended that you look at this book. Since then, regular expressions have been widely used in a variety of Unix or Unix-like tools, as well known Perl. Perl's regular expression, derived from the regex written by Henry Spencer, has evolved into Pcre (Perl-compatible regular-expression Perl compatible Regular Expressions), Pcre is a Philip A library developed by Hazel for use by many modern tools. The first practical application of the regular expression is the QED Editor in Unix. Then, regular expressions have been widely applied and developed in various computer languages or applications, and become a beautiful and sound lark in the forest of computer technology. The above is a historical description of the origin and development of regular expressions, and now regular expressions are based on the text-Editors and search tools still occupy a very important position. In the last 60 years, regular expressions have evolved from vague and esoteric mathematical concepts to the main functions of various tools and software packages used in computers. Not only are many UNIX tools supporting regular expressions, in the last 20 years, the idea and application of regular expressions has been supported and embedded in most Windows developer toolkits in the Windows camp. From regular-style exploration and development in Microsoft Visual Basic 6 or Microsoft VBScript to the. NET Framework, the support of the Windows family of regular expressions has grown to unparalleled heights, with almost all Microsoft Developers and all. NET languages, you can use regular expressions. If you are a computer language worker, then you will be in the mainstream operating system (*nix[linux, UNIX, etc.), Windows, HP, BeOS, etc., Mainstream development languages (Delphi, Scala, PHP, C #, Java, C + +, Objective-c, Swift, VB, Javascript, Ruby, and Python, and hundreds of millions of applications, can see the graceful dance of the regular expression. Concept EditorA regular expression is a logical formula for string manipulation, which is a "rule string" that is used to express a filtering logic for a string, using a well-defined set of characters, and a combination of those particular characters. Given a regular expression and another string, we can achieve the following purposes: 1. Whether the given string conforms to the filter logic of the regular expression (called "Match"); 2. You can get the specific part of the string that we want through the regular expression. Regular expressions are characterized by: 1. Flexibility, logic and functionality are very strong; 2. You can quickly and easily achieve complex control of strings in a very simple way. 3. For those who have just contacted, it is more obscure and difficult to understand. Because a regular expression is primarily applied to text, it is applied to a variety of text editors, as small as the famous editor EditPlus, to large editors such as Microsoft Word and Visual Studio, and you can use regular expressions to work with text content. Engine Editor The regular engine can be divided into two main categories: one is the DFA, the other is the NFA. Both engines have a long history (up to more than 20 years), and many variants are produced by both engines. So POSIX's introduction avoids the need for variations to continue to produce. As a result, the mainstream of the regular engine is divided into 3 categories: One, DFA, two, traditional NFA, third, POSIX NFA. The dfa  engine executes in a linear state because they do not require backtracking (and therefore they never test the same character two times). The DFA engine can also ensure that the longest possible string is matched. However, because the DFA engine contains only a limited state, it cannot match a pattern with a reverse reference, and because it does not construct a display extension, it cannot catch a subexpression. The traditional NFA engine runs the so-called "greedy" matching backtracking algorithm, which tests all possible extensions of the regular expression in a specified order and accepts the first match. Because the traditional NFA constructs a specific extension of a regular expression to achieve a successful match, it can capture the subexpression matching and matching reverse references. However, because of the traditional NFA backtracking, it can access the exact same state multiple times (if the state is reached through a different path). Therefore, at worst, it can perform very slowly. Because the traditional NFA accepts the first match it finds, it may also cause other (possibly longer) matches to be uncovered. The POSIX NFA engine is similar to a traditional NFA engine, except that they continue to backtrack until they can ensure that the longest possible match has been found. As a result, the POSIX NFA engine is slower than the traditional NFA engine, and when using a POSIX NFA, you may not want to support shorter matching searches, rather than longer matching searches, when you change the order of backtracking searches. The use of the DFA engine is mainly: Awk,egrep,flex,lex,mysql,procmail, and so on, the application of the traditional NFA engine is mainly: GNU Emacs,java,ergp,less,more,. NET language, PCRE library,perl,php,python,ruby,sed,vi; programs that use the POSIX NFA engine are: Mawk,mortice Kern Systems ' Utilities,gnu Emacs ( Can be explicitly specified when used, and there are engines using Dfa/nfa: GNU awk,gnu grep/egrep,tcl. For example, the difference between an NFA and a DFA is simple: for example, there is a string this is Yansen ' s blog, the regular expression is/ya (msen|nsen|NSEM)/(don't care about the expression, here just to illustrate the difference between the work of the engine). The NFA works as follows, look for y in the string and then match if it is a, and if it is a, go ahead and find out if it is m if it is not, then it is n after the match (when the Msen is selected). Then continue to see if S,e is followed, then the test is N, N is the match succeeds, not the test is M. Why is M. Because the NFA works by using regular expressions as a standard, the string is repeatedly tested so that the same string can be repeatedly tested many times. And the DFA is not so, the DFA will start from this in the beginning of the T lookup y, positioning to Y, known as a, then see if the expression has a, here is exactly a. Then after the string A is N, the DFA tests the expression in turn, at which point the Msen does not meet the requirement of elimination. Nsen and Nsem meet the requirements, then the DFA checks the string sequentially, and if the N in Sen is detected only if the Nsen branch conforms, the match succeeds. As you can see, the two engines work in a completely different way, one (NFA) is an expression-driven, and one (DFA) is text-driven. In general, the DFA engine searches faster. But the NFA is dominated by expressions, but more easily manipulated, so the average programmer prefers an NFA engine. Both engines have strengths, and real references depend on your needs and the language you use. symbol Editing(Excerpt from the "Regular expression") regular expression [2] consists of some ordinary characters and some metacharacters (metacharacters). Ordinary characters include uppercase and lowercase letters and numbers, and metacharacters have special meanings, which we will explain below. In the simplest case, a regular expression looks like a normal lookup string. For example, the regular expression "testing" does not contain any metacharacters, it can match strings such as "testing" and "testing123", but cannot match "testing". In order to really use a good regular expression, the correct understanding of the meta character is the most important thing. The following table lists all the metacharacters and a short description of them.
Metacharacters Describe
\ The next character marker, or a backward reference, or a octal escape character. For example, "\\n" matches \ n. "\ n" matches line breaks. The sequence "\ \" matches "\" and "\ (matches" (). That is equivalent to the concept of "escape characters" in many programming languages.
^ Matches the start position of the input string. If the multiline property of the RegExp object is set, ^ also matches the position after "\ n" or "\ r".
$ Matches the end position of the input string. If the multiline property of the RegExp object is set, the $ also matches the position before "\ n" or "\ r".
* Matches the preceding subexpression any time. For example, zo* can match "Z", "Zo" and "Zoo". * is equivalent to {0,}.
+ Matches the preceding subexpression one or more times (greater than or equal to 1 times). For example, "zo+" can Match "Zo" and "Zoo", but cannot match "Z". + is equivalent to {1,}.
? Match the preceding subexpression 0 times or once. For example, "Do (es)?" You can match the "do" in "do" or "does".
N n is a non-negative integer. Matches the determined n times. For example, "o{2}" cannot match "O" in "Bob", but can match two o in "food".
{N,} n is a non-negative integer. Match at least n times. For example, "o{2,}" cannot match "O" in "Bob", but can match all o in "Foooood". "O{1,}" is equivalent to "o+". "O{0,}" is equivalent to "o*".
{N,m} M and n are non-negative integers, of which n<=m. Matches n times at least and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". "o{0,1}" is equivalent to "O?". Notice that there is no space between the comma and the two number.
? When the character is immediately following any other qualifier (*,+,?,{n},{n,},{n,m}), the match pattern is not greedy. Non-greedy patterns match as few strings as possible, while the default greedy pattern matches as many of the searched strings as possible. For example, for the string "Oooo", "o+?" A single "O" will be matched, and "o+" will match all "O".
. Point Matches any single character except "\ r \ n". To match any character including "\ r \ n", use a pattern like "[\s\s]".
(pattern) Match pattern and get this match. The obtained matches can be obtained from the resulting matches collection, use the Submatches collection in VBScript, and use the $0...$9 property in JScript. To match the parentheses character, use "\ (" or "\)".
(?:p Attern) Non-fetch matches, matching pattern without obtaining matching results, not stored for later use. This is in use or the character "(|)" It is useful to combine parts of a pattern. For example, "Industr (?: y|ies)" is an expression more abbreviated than "Industry|industries".
(? =pattern) Non-fetch match, forward positive check, matches the lookup string at the beginning of any string matching pattern, which does not need to be acquired for later use. For example, the Windows (? =95|98| nt|2000) "Can match windows in Windows2000, but cannot match windows in Windows3.1." It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check.
(?! Pattern Non-fetch match, positive negation lookup, matches the find string at the beginning of any mismatched pattern string that does not need to be acquired for later use. For example, Windows (?! 95|98| nt|2000) "Can match windows in Windows3.1, but cannot match windows in Windows2000."
(? <=pattern)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.