Python converts HTML to plain Text,
This document describes how to convert HTML to Text in Python. Share it with you for your reference. The specific analysis is as follows:
Today, the project needs to convert HTML to plain text and search for it on the Internet. It turns out that Python is incredibly versatile and omnipotent, with a wide variety of methods.
Take the two methods I personally tried today as an example to facilitate future generations:
Method 1:
1. Install nltk. you can install it in pipy.
(Note: The following packages are required: numpy, PyYAML)
2. Test code:
Copy codeThe Code is as follows: >>> import nltk
>>> Aa = r '''''
<Html>
<Body>
<B> Project: </B> DeHTML <br>
<B> Description </B>: <br>
This small script is intended to allow conversion from HTML markup
Plain text.
</Body>
</Html>
'''
>>> Aa
'\ N >>> <Strong> print nltk. clean_html (aa) </strong>
Project: DeHTML
Description:
This small script is intended to allow conversion from HTML markup
Plain text.
Method 2:
If you think that nltk is too cumbersome and rarely used, you can write your own Code. The Code is as follows:
Copy codeThe Code is as follows: from HTMLParser import HTMLParser
From re import sub
From sys import stderr
From traceback import print_exc
Class _ DeHTMLParser (HTMLParser ):
Def _ init _ (self ):
HTMLParser. _ init _ (self)
Self. _ text = []
Def handle_data (self, data ):
Text = data. strip ()
If len (text)> 0:
Text = sub ('[\ t \ r \ n] +', '', text)
Self. _ text. append (text + '')
Def handle_starttag (self, tag, attrs ):
If tag = 'P ':
Self. _ text. append ('\ n \ n ')
Elif tag = 'br ':
Self. _ text. append ('\ n ')
Def handle_startendtag (self, tag, attrs ):
If tag = 'br ':
Self. _ text. append ('\ n \ n ')
Def text (self ):
Return ''. join (self. _ text). strip ()
Def dehtml (text ):
Try:
Parser = _ DeHTMLParser ()
Parser. feed (text)
Parser. close ()
Return parser. text ()
Except t:
Print_exc (file = stderr)
Return text
Def main ():
Text = r '''''
<Html>
<Body>
<B> Project: </B> DeHTML <br>
<B> Description </B>: <br>
This small script is intended to allow conversion from HTML markup
Plain text.
</Body>
</Html>
'''
Print (dehtml (text ))
If _ name _ = '_ main __':
Main ()
Running result:
>>>================================== RESTART ==== ======================================
>>>
Project: DeHTML
Description:
This small script is intended to allow conversion from HTML markup to plain text.
I hope this article will help you with Python programming.