Regular expression re.sub substitution incomplete problem phenomenon and its root cause

Last Update:2018-09-18 Source: Internet

Author: User

Tags function prototype

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Title: Regular expression re.sub substitution incomplete problem phenomenon and its root cause
Toc:true
Comment:true
Date:2018-08-27 21:48:22
Tags: ["Python", "Regular expression"]
Category: ["Python"]
---

Problem description

The cause of the problem comes from a regular replacement. In order to extract the text from a piece of HTML code, remove all the HTML tags and attributes, you can write a Python function:

import redef remove_tag(html):    text = re.sub('<.*?>', '', html, re.S)    return text

This code uses the replacement function of the regular expression re.sub . The first parameter of this function represents the regular expression of the content that needs to be replaced, since the HTML tags are wrapped using angle brackets, so the use <.*?> can match all <xxx yyy="zzz"> and </xxx> .

The second parameter indicates what content is to be replaced by the match. Since I need to extract the text, just replace all the HTML tags with an empty string. The third parameter is the text that needs to be replaced, in this case the HTML source snippet.

As for re.S , in an article 4 years ago, I talked about its usage: RE.S in a python regular expression.

Now use a piece of HTML code to test:

import redef remove_tag(html):    text = re.sub('<.*?>', '', html, re.S)    return textsource_1 = '''<div class="content">今天的主角是<a href="xxx">kingname</a>，我们掌声欢迎！</div>'''text = remove_tag(source_1)print(text)

Run as shown, fully functional as expected

Let's test the code for a line break:

import redef remove_tag(html):    text = re.sub('<.*?>', '', html, re.S)    return textsource_2 = '''<div class="content">    今天的主角是    <a href="xxx">kingname</a>    ，我们掌声欢迎！</div>'''text = remove_tag(source_2)print(text)

The running effect is as shown and fully conforms to expectations.

After testing, in the vast majority of cases, the body can be extracted from the HTML code snippet. But there are exceptions.

Exceptional cases

There is a section of HTML code that is long and reads as follows:

</span><span>遇见kingname</span></a ><a  ><span class='url-icon'>< img '></span><span >温柔</span></a ><a  ><span >#青南#</span></a > <br />就在这里…<br />我的小侯爷呢？？？

The last two HTML tag substitution failed as shown in the run effect.

At first I thought it was the space inside the HTML or the quotation marks that caused the problem, so I simplified the HTML code:

</span><span>遇见kingname</span></a><a><span></span><span>温柔</span></a><a><span>#青南#</span></a><br/>就在这里…<br/>我的小侯爷呢

The problem persists, as shown in.

And even more surprisingly, if the first tag is deleted, then there is a label missing from the replacement result, as shown in.

In fact, not only the first tag is deleted, but any one of the previous tags can be deleted to reduce the result of a label. If you delete the previous two or more tags, the result is normal.

Faq

This looks very strange problem, the root cause in the re.sub of the 4th parameter. From the function prototype you can see:

def sub(pattern, repl, string, count=0, flags=0)

The fourth argument is that count represents the number of replacements, re. s if it is to be used, it should be a fifth parameter. So if we remove_tag make some changes to the function, then the result is correct:

def remove_tag(html):    text = re.sub('<.*?>', '', html, flags=re.S)    return text

So the question comes, put the re. s in the position of count, why is the code not error-free? re.Sis it a number? In fact, if you print it you will find that re.S you can actually do it as a number:

>>> import re>>> print(int(re.S))16

Now back to count the HTML code of the problem, found that the last two more <br> tags, just good is the 17th and 18 tags, and because count the fill re.S can be treated as 16来, then Python will replace the first 16 tags with an empty string, leaving the last two.

The cause of the problem is clear.

This problem has not been detected early, for several reasons:

is a code snippet, and in most cases there are less than 16 HTML tags, so the problem is hidden.
Re. The S is an object, but also a number, and count receives exactly the same number of arguments. In many programming languages, constants use numbers, which are then represented by a meaningful capital letter.
Re. S is handled by <div class= "123" \N> instead of <div class= "123" >\N</DIV> However, the code snippet label is the second case, so the code snippet does not actually add re. The S effect is the same.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More