This article mainly describes how Python uses regular expressions to filter or replace HTML tags, simply introduces the Python regular correlation syntax and analyzes Python's regular expression-based HTML tag filtering and substitution techniques in conjunction with specific instance forms, and the friends you need can refer to
This example describes how Python uses regular expressions to filter or replace HTML tags. Share to everyone for your reference, as follows:
Python Regular expression key content:
Python regular expression escape character:
. Match any character other than line break
\w match letters or numbers or underscores or kanji
\s matches any whitespace character
\d Matching numbers
\b Match the beginning or end of a word
^ Start of matching string
$ match End of string
\w matches any characters that are not letters, numbers, underscores, kanji
\s matches any character that is not a white letter
\d matches any non-numeric character
\b Match is not where the word starts or ends
[^x] matches any character except X
[^aeiou] matches any character except the letters AEIOU
Common python Regular Expression qualifier code/Syntax Description:
* Repeat 0 or more times
+ Repeat one or more times
? Repeat 0 or one time
{n} repeats n times
{n,} repeats n or more times
{N,m} repeats n to M times
About the Python regular expression naming group:
Named group: (? P<name>, ...)
This article also mentions the definition (the beginning of the question mark, there is a ' < ' in the forward direction, there is a '! ' number):
Forward definition (<= ...)
Back-to-definition (? = ...)
Forward non-definition (?<!....)
Back to non-defined (?! .....)
Python removes (filters) HTML tag sample code with regular expressions
#-*-coding:utf-8-*-import re# #过滤HTML中的标签 # Remove information such as tags in HTML # @param htmlstr HTML strings. Def filter_tags (HTMLSTR): # filter CDATA R First E_cdata = Re.compile ("//<! Cdata\[[>]∗//\]> ", Re. I) #匹配CDATA re_script = Re.compile (' <\s*script[^>]*>[^<]*<\s*/\s*script\s*> ', re. I) # Script Re_style = re.compile (' <\s*style[^>]*>[^<]*<\s*/\s*style\s*> ', re. I) # Style RE_BR = Re.compile (' <br\s*?/?> ') # handles line Wrapping Re_h = re.compile (' </?\w+[^>]*> ') # HTML tags re_comment = Re.compile (' <!--[^>]*--> ') # HTML comment s = re_cdata.sub (", HTMLSTR) # remove CDATA s = Re_script.sub (', s) # Remove SCR IPT s = re_style.sub (', s) # remove style s = re_br.sub (' \ n ', s) # convert BR to newline s = re_h.sub (', s) # Remove HTML tag s = re_comment . Sub (", s) # Remove HTML comment # Remove extra blank line blank_line = Re.compile (' \n+ ') s = blank_line.sub (' \ n ', s) s = replacecharentity (s) # Replace entity return s# #替换常用HTML字符实体. # replaces special character entities in HTML with normal characters. # You can add new entity characters to char_entities and process more HTML character entities. # @param HTMLSTR The HTML string. def ReplacecharEntity (HTMLSTR): char_entities = {' nbsp ': ', ' ', ': ', ' lt ': ' < ', ' $ ': ' < ', ' GT ': ' > ', ' A ': ' > ', ' amp ': ' & ', ' quot ': ' & ', ' ' ' ', ' ' ' ' ' ', ' ' ' ' ' ' R '? (? p<name>\w+); ') SZ = Re_charentity.search (htmlstr) while sz:entity = Sz.group () # entity full name, such as > key = Sz.group (' name ') # removal &; After entity, such as > for GT try:htmlstr = Re_charentity.sub (Char_entities[key], HTMLSTR, 1) Sz = Re_charentity.search (h TMLSTR) except Keyerror: # replaces Htmlstr with an empty string = Re_charentity.sub (", HTMLSTR, 1) Sz = Re_charentity.search (h TMLSTR) return Htmlstrdef Repalce (S, Re_exp, repl_string): Return Re_exp.sub (repl_string, s) if __name__ = = ' __main__ ': s = File (' test.html '). Read () News = Filter_tags (s) Print news