Python uses regular expressions to filter or replace HTML tags by introducing

Source: Internet
Author: User
Tags cdata html comment
This article mainly describes how Python uses regular expressions to filter or replace HTML tags, simply introduces the Python regular correlation syntax and analyzes Python's regular expression-based HTML tag filtering and substitution techniques in conjunction with specific instance forms, and the friends you need can refer to

This example describes how Python uses regular expressions to filter or replace HTML tags. Share to everyone for your reference, as follows:

Python Regular expression key content:

Python regular expression escape character:

. Match any character other than line break
\w match letters or numbers or underscores or kanji
\s matches any whitespace character
\d Matching numbers
\b Match the beginning or end of a word
^ Start of matching string
$ match End of string
\w matches any characters that are not letters, numbers, underscores, kanji
\s matches any character that is not a white letter
\d matches any non-numeric character
\b Match is not where the word starts or ends
[^x] matches any character except X
[^aeiou] matches any character except the letters AEIOU

Common python Regular Expression qualifier code/Syntax Description:

* Repeat 0 or more times
+ Repeat one or more times
? Repeat 0 or one time
{n} repeats n times
{n,} repeats n or more times
{N,m} repeats n to M times
About the Python regular expression naming group:
Named group: (? P<name>, ...)
This article also mentions the definition (the beginning of the question mark, there is a ' < ' in the forward direction, there is a '! ' number):
Forward definition (<= ...)
Back-to-definition (? = ...)
Forward non-definition (?<!....)
Back to non-defined (?! .....)

Python removes (filters) HTML tag sample code with regular expressions


#-*-coding:utf-8-*-import re# #过滤HTML中的标签 # Remove information such as tags in HTML # @param htmlstr HTML strings. Def filter_tags (HTMLSTR): # filter CDATA R First E_cdata = Re.compile ("//<! Cdata\[[>]∗//\]> ", Re. I) #匹配CDATA re_script = Re.compile (' <\s*script[^>]*>[^<]*<\s*/\s*script\s*> ', re. I) # Script Re_style = re.compile (' <\s*style[^>]*>[^<]*<\s*/\s*style\s*> ', re.  I) # Style RE_BR = Re.compile (' <br\s*?/?> ') # handles line Wrapping Re_h = re.compile (' </?\w+[^>]*> ') # HTML tags re_comment = Re.compile (' <!--[^>]*--> ') # HTML comment s = re_cdata.sub (", HTMLSTR) # remove CDATA s = Re_script.sub (', s) # Remove SCR IPT s = re_style.sub (', s) # remove style s = re_br.sub (' \ n ', s) # convert BR to newline s = re_h.sub (', s) # Remove HTML tag s = re_comment  . Sub (", s) # Remove HTML comment # Remove extra blank line blank_line = Re.compile (' \n+ ') s = blank_line.sub (' \ n ', s) s = replacecharentity (s) # Replace entity return s# #替换常用HTML字符实体. # replaces special character entities in HTML with normal characters. # You can add new entity characters to char_entities and process more HTML character entities. # @param HTMLSTR The HTML string. def ReplacecharEntity (HTMLSTR): char_entities = {' nbsp ': ', ' ', ': ', ' lt ': ' < ', ' $ ': ' < ', ' GT ': ' > ', ' A ': ' > ', ' amp ': ' & ', ' quot ': ' & ', ' ' ' ', ' ' ' ' ' ', ' ' ' ' ' ' R '? (?  p<name>\w+); ') SZ = Re_charentity.search (htmlstr) while sz:entity = Sz.group () # entity full name, such as > key = Sz.group (' name ') # removal &; After entity, such as > for GT try:htmlstr = Re_charentity.sub (Char_entities[key], HTMLSTR, 1) Sz = Re_charentity.search (h TMLSTR) except Keyerror: # replaces Htmlstr with an empty string = Re_charentity.sub (", HTMLSTR, 1) Sz = Re_charentity.search (h  TMLSTR) return Htmlstrdef Repalce (S, Re_exp, repl_string): Return Re_exp.sub (repl_string, s) if __name__ = = ' __main__ ': s = File (' test.html '). Read () News = Filter_tags (s) Print news
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.