Regular Expressions (Regular Expression) in Python)

Source: Internet
Author: User
Tags string back

When writing a program or webpage that processes strings, it is often necessary to find strings that meet certain complex rules. Regular Expressions are tools used to describe these rules. In other words, a regular expression is the code that records text rules.

You may have used the wildcard (wildcard) for file search in Windows/Dos, that is, * and ?. If you want to find all the Word documents in a directory, you will search for *. doc. Here, * is interpreted as any string. Like wildcards, regular expressions are also a tool for text matching, but they can more accurately describe your needs than wildcards-of course, the cost is more complex-for example, you can write a regular expression to search for all numbers starting with 0, followed by 2-3 numbers, and then a hyphen "-", it is a string of 7 or 8 digits (such as 010-12345678 or 0376-7654321 ). Learn the regular expression of a good website stamp here!

Let's talk about regular expressions in Python! We often suffer from case-sensitivity issues when processing strings (such as search, replacement, and resolution operations. Ref: click here.

The search methods look for a single, hard-coded substring, and they are always case-sensitive. To do case-insensitive searches of a stringS, You must callS. lower ()OrS. upper ()And make sure your search strings are the appropriate case to match.ReplaceAndSplitMethods have the same limitations.

If what you're trying to do can be accomplished with string functions, you shoshould use them. they're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code. but if you find yourself using a lot of different string functionsIfStatements to handle special cases, or if you're combining themSplitAndJoinAnd list comprehensions in weird unreadable ways, you may need to move up to regular expressions.

Although the regular expression syntax is tight and unlike normal code, the result can end up beingMoreReadable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions to make them practically self-documenting.

Example:

Case Study: Street Addresses

This series of examples was too red by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just make this stuff up; it's actually useful .) this example shows how I approached the problem.

Example 7.1. Matching at the End of a String
>>> s = '100 NORTH MAIN ROAD'>>> s.replace('ROAD', 'RD.')          

>>>S = '1970 north broad road'>>>S. replace ('road', 'RD .')

>>>S [:-4] + s [-4:]. replace ('road', 'RD .')

>>>Import re

>>>Re. sub ('road $ ', 'RD.', s)

My goal is to standardize a street address so that'Road'Is always abbreviated'RD .'. At first glance, I thought this was simple enough that I cocould just use the string methodReplace. After all, all the data was already uppercase, so case mismatches wocould not be a problem. And the search string,'Road', Was a constant. And in this deceptively simple example,S. replaceDoes indeed work.

Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that'Road'Appears twice in the address, once as part of the street name'Broad'And once as its own word.ReplaceMethod sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.

To solve the problem of addresses with more than one'Road'Substring, you cocould resort to something like this: only search and replace'Road'In the last four characters of the address (S [-4:]), And leave the string alone (S [:-4]). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string you're replacing (if you were replacing'Street'With'St .', You woshould need to useS [:-6]AndS [-6:]. replace (...)). Wocould you like to come back in six months and debug this? I know I wouldn't.

It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained inReModule.

Take a look at the first parameter:'Road $'. This is a simple regular expression that matches'Road'Only when it occurs at the end of a string.$Means "end of the string". (There is a corresponding character, the caret^, Which means "beginning of the string ".)

UsingRe. subFunction, you search the stringSFor the regular expression'Road $'And replace it'RD .'. This matchesROADAt the end of the stringS, But doesNotMatchROADThat's part of the wordBROAD, Because that's in the middleS.

Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching'Road'At the end of the address, was not good enough, because not all addresses encoded a street designation at all; some just ended with the street name. most of the time, I got away with it, but if the street name was'Broad', Then the regular expression wocould match'Road'At the end of the string as part of the word'Broad', Which is not what I wanted.

Example 7.2. Matching Whole Words
>>> s = '100 BROAD'>>> re.sub('ROAD$', 'RD.', s)'100 BRD.'>>> re.sub('\\bROAD$', 'RD.', s)  

>>>Re. sub (R' \ bROAD $ ', 'RD.', s)

>>>S = '2017 broad road apt. 3'>>>Re. sub (R' \ bROAD $ ', 'RD.', s)

>>>Re. sub (R' \ bROAD \ B ', 'RD.', s)

What IReallyWanted was to match'Road'When it was at the end of the stringAndIt was its own whole word, not a part of some larger word. To express this in a regular expression, you use\ B, Which means "a word boundary must occur right here". In Python, this is complicated by the fact that'\'Character in a string must itself be escaped. this is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. on the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it's a bug in syntax or a bug in your regular expression.

To work around the backslash plags, you can use what is called a raw string, by prefixing the string with the letterR. This tells Python that nothing in this string shocould be escaped;'\ T'Is a tab character,R' \ t'Is really the backslash character\Followed by the letterT. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly (and regular expressions get confusing quickly enough all by themselves ).

 * Sigh *Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word'Road'As a whole word by itself, but it wasn't at the end, because the address had an apartment number after the street designation. Because'Road'Isn' t at the very end of the string, it doesn' t match, so the entire callRe. subEnds up replacing nothing at all, and you get the original string back, which is not what you want.

To solve this problem, I removed$Character and added another\ B. Now the regular expression reads "match'Road'When it's a whole word by itself anywhere in the string, "whether at the end, the beginning, or somewhere in the middle.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.