Python uses regular expressions to filter or replace HTML tags by introducing

Last Update:2017-09-26 Source: Internet

Author: User

Tags cdata html comment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly describes how Python uses regular expressions to filter or replace HTML tags, simply introduces the Python regular correlation syntax and analyzes Python's regular expression-based HTML tag filtering and substitution techniques in conjunction with specific instance forms, and the friends you need can refer to

This example describes how Python uses regular expressions to filter or replace HTML tags. Share to everyone for your reference, as follows:

Python Regular expression key content:

Python regular expression escape character:

. Match any character other than line break
\w match letters or numbers or underscores or kanji
\s matches any whitespace character
\d Matching numbers
\b Match the beginning or end of a word
^ Start of matching string
$ match End of string
\w matches any characters that are not letters, numbers, underscores, kanji
\s matches any character that is not a white letter
\d matches any non-numeric character
\b Match is not where the word starts or ends
[^x] matches any character except X
[^aeiou] matches any character except the letters AEIOU

Common python Regular Expression qualifier code/Syntax Description:

* Repeat 0 or more times
+ Repeat one or more times
? Repeat 0 or one time
{n} repeats n times
{n,} repeats n or more times
{N,m} repeats n to M times
About the Python regular expression naming group:
Named group: (? P<name>, ...)
This article also mentions the definition (the beginning of the question mark, there is a ' < ' in the forward direction, there is a '! ' number):
Forward definition (<= ...)
Back-to-definition (? = ...)
Forward non-definition (?<!....)
Back to non-defined (?! .....)

Python removes (filters) HTML tag sample code with regular expressions

#-*-coding:utf-8-*-import re# #过滤HTML中的标签 # Remove information such as tags in HTML # @param htmlstr HTML strings. Def filter_tags (HTMLSTR): # filter CDATA R First E_cdata = Re.compile ("//<! Cdata\[[>]∗//\]> ", Re. I) #匹配CDATA re_script = Re.compile (' <\s*script[^>]*>[^<]*<\s*/\s*script\s*> ', re. I) # Script Re_style = re.compile (' <\s*style[^>]*>[^<]*<\s*/\s*style\s*> ', re.  I) # Style RE_BR = Re.compile (' <br\s*?/?> ') # handles line Wrapping Re_h = re.compile (' </?\w+[^>]*> ') # HTML tags re_comment = Re.compile (' <!--[^>]*--> ') # HTML comment s = re_cdata.sub (", HTMLSTR) # remove CDATA s = Re_script.sub (', s) # Remove SCR IPT s = re_style.sub (', s) # remove style s = re_br.sub (' \ n ', s) # convert BR to newline s = re_h.sub (', s) # Remove HTML tag s = re_comment  . Sub (", s) # Remove HTML comment # Remove extra blank line blank_line = Re.compile (' \n+ ') s = blank_line.sub (' \ n ', s) s = replacecharentity (s) # Replace entity return s# #替换常用HTML字符实体. # replaces special character entities in HTML with normal characters. # You can add new entity characters to char_entities and process more HTML character entities. # @param HTMLSTR The HTML string. def ReplacecharEntity (HTMLSTR): char_entities = {' nbsp ': ', ' ', ': ', ' lt ': ' < ', ' $ ': ' < ', ' GT ': ' > ', ' A ': ' > ', ' amp ': ' & ', ' quot ': ' & ', ' ' ' ', ' ' ' ' ' ', ' ' ' ' ' ' R '? (?  p<name>\w+); ') SZ = Re_charentity.search (htmlstr) while sz:entity = Sz.group () # entity full name, such as > key = Sz.group (' name ') # removal &; After entity, such as > for GT try:htmlstr = Re_charentity.sub (Char_entities[key], HTMLSTR, 1) Sz = Re_charentity.search (h TMLSTR) except Keyerror: # replaces Htmlstr with an empty string = Re_charentity.sub (", HTMLSTR, 1) Sz = Re_charentity.search (h  TMLSTR) return Htmlstrdef Repalce (S, Re_exp, repl_string): Return Re_exp.sub (repl_string, s) if __name__ = = ' __main__ ': s = File (' test.html '). Read () News = Filter_tags (s) Print news

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More