Title: Regular expression re.sub substitution incomplete problem phenomenon and its root cause
Toc:true
Comment:true
Date:2018-08-27 21:48:22
Tags: ["Python", "Regular expression"]
Category: ["Python"]
---
Problem description
The cause of the problem comes from a regular replacement. In order to extract the text from a piece of HTML code, remove all the HTML tags and attributes, you can write a Python function:
import redef remove_tag(html): text = re.sub('<.*?>', '', html, re.S) return text
This code uses the replacement function of the regular expression re.sub
. The first parameter of this function represents the regular expression of the content that needs to be replaced, since the HTML tags are wrapped using angle brackets, so the use <.*?>
can match all <xxx yyy="zzz">
and </xxx>
.
The second parameter indicates what content is to be replaced by the match. Since I need to extract the text, just replace all the HTML tags with an empty string. The third parameter is the text that needs to be replaced, in this case the HTML source snippet.
As for re.S
, in an article 4 years ago, I talked about its usage: RE.S in a python regular expression.
Now use a piece of HTML code to test:
import redef remove_tag(html): text = re.sub('<.*?>', '', html, re.S) return textsource_1 = '''<div class="content">今天的主角是<a href="xxx">kingname</a>,我们掌声欢迎!</div>'''text = remove_tag(source_1)print(text)
Run as shown, fully functional as expected
Let's test the code for a line break:
import redef remove_tag(html): text = re.sub('<.*?>', '', html, re.S) return textsource_2 = '''<div class="content"> 今天的主角是 <a href="xxx">kingname</a> ,我们掌声欢迎!</div>'''text = remove_tag(source_2)print(text)
The running effect is as shown and fully conforms to expectations.
After testing, in the vast majority of cases, the body can be extracted from the HTML code snippet. But there are exceptions.
Exceptional cases
There is a section of HTML code that is long and reads as follows:
</span><span>遇见kingname</span></a ><a ><span class='url-icon'>< img '></span><span >温柔</span></a ><a ><span >#青南#</span></a > <br />就在这里…<br />我的小侯爷呢???
The last two HTML tag substitution failed as shown in the run effect.
At first I thought it was the space inside the HTML or the quotation marks that caused the problem, so I simplified the HTML code:
</span><span>遇见kingname</span></a><a><span></span><span>温柔</span></a><a><span>#青南#</span></a><br/>就在这里…<br/>我的小侯爷呢
The problem persists, as shown in.
And even more surprisingly, if the first tag is
deleted, then there is a label missing from the replacement result, as shown in.
In fact, not only the first tag is deleted, but any one of the previous tags can be deleted to reduce the result of a label. If you delete the previous two or more tags, the result is normal.
Faq
This looks very strange problem, the root cause in the re.sub of the 4th parameter. From the function prototype you can see:
def sub(pattern, repl, string, count=0, flags=0)
The fourth argument is that count represents the number of replacements, re. s if it is to be used, it should be a fifth parameter. So if we remove_tag
make some changes to the function, then the result is correct:
def remove_tag(html): text = re.sub('<.*?>', '', html, flags=re.S) return text
So the question comes, put the re. s in the position of count, why is the code not error-free? re.S
is it a number? In fact, if you print it you will find that re.S
you can actually do it as a number:
>>> import re>>> print(int(re.S))16
Now back to count the HTML code of the problem, found that the last two more <br>
tags, just good is the 17th and 18 tags, and because count
the fill re.S
can be treated as 16来, then Python will replace the first 16 tags with an empty string, leaving the last two.
The cause of the problem is clear.
This problem has not been detected early, for several reasons:
The HTML code that is replaced by
- is a code snippet, and in most cases there are less than 16 HTML tags, so the problem is hidden.
-
Re. The S
is an object, but also a number, and count receives exactly the same number of arguments. In many programming languages, constants use numbers, which are then represented by a meaningful capital letter.
-
Re. S
is handled by <div class= "123" \N>
instead of <div class= "123" >\N</DIV>
However, the code snippet label is the second case, so the code snippet does not actually add re. The S
effect is the same.