The example in this article describes Python's method of converting HTML to text-only text. Share to everyone for your reference. The specific analysis is as follows:
Today, the project needs to convert HTML to plain text, search the Web and discover that Python is a powerful, omnipotent, and varied approach.
Take the two methods that you have tried today to facilitate posterity:
Method One:
1. Installation of NLTK, you can go to pipy installed
(Note: You need to rely on the following packages: NumPy, Pyyaml)
2. Test code:
Copy Code code as follows:
>>> Import NLTK
>>> AA = R ' "'
<body>
<b>Project:</b> dehtml<br>
<b>Description</b>:<br>
This small the script is intended to allow conversion from HTML markup to
Plain text.
</body>
'''
>>> AA
' \n>>> <strong>print nltk.clean_html (aa) </strong>
Project:dehtml
Description:
This small the script is intended to allow conversion from HTML markup to
Plain text.
Method Two:
If you think NLTK is too bulky and overqualified, you can write your own code, the code is as follows:
Copy Code code as follows:
From Htmlparser import Htmlparser
From re import sub
From sys import STDERR
From Traceback import Print_exc
Class _dehtmlparser (Htmlparser):
def __init__ (self):
Htmlparser.__init__ (self)
Self.__text = []
def handle_data (self, data):
Text = Data.strip ()
If Len (text) > 0:
Text = Sub (' [\t\r\n]+ ', ', text ')
Self.__text.append (text + ' ")
def handle_starttag (self, Tag, attrs):
if tag = = ' P ':
Self.__text.append (' \ n ')
elif tag = = ' BR ':
Self.__text.append (' \ n ')
def handle_startendtag (self, Tag, attrs):
if tag = = ' BR ':
Self.__text.append (' \ n ')
def text (self):
Return '. Join (Self.__text). Strip ()
def dehtml (text):
Try
Parser = _dehtmlparser ()
Parser.feed (text)
Parser.close ()
Return Parser.text ()
Except
Print_exc (File=stderr)
return text
def main ():
Text = R ' "'
<body>
<b>Project:</b> dehtml<br>
<b>Description</b>:<br>
This small the script is intended to allow conversion from HTML markup to
Plain text.
</body>
'''
Print (dehtml (text))
if __name__ = = ' __main__ ':
Main ()
Run Result:
>>> ================================ Restart ================================
>>>
Project:dehtml
Description:
This small the script is intended to allow conversion from HTML markup to plain text.
I hope this article will help you with your Python programming.