This example describes how Python converts HTML to text-only text. Share to everyone for your reference. The specific analysis is as follows:
Today, the project needs to convert HTML to plain text, to search the Internet, and found that Python is truly powerful, omnipotent, the method is a variety of.
Take today's two examples of ways to make it easier for posterity:
Method One:
1. Install NLTK, can go to pipy
(Note: You need to rely on the following packages: NumPy, Pyyaml)
2. Test the code:
Copy the Code code as follows:
>>> Import NLTK
>>> AA = r ""
Project:Dehtml
Description:
This small script was intended to allow conversion from HTML markup to
Plain text.
'''
>>> AA
' \ n\ n\ n
Project:Dehtml
\ n
Description:
\ n This small script was intended to allow conversion from HTML markup to \ n Plain text.\n\ n\ n '
>>>
print nltk.clean_html (aa)
Project:dehtml
Description:
This small script was intended to allow conversion from HTML markup to
Plain text.
Method Two:
If you feel that NLTK is too cumbersome and overqualified, you can write your own code, the code is as follows:
Copy the Code code as follows:
From Htmlparser import Htmlparser
From re import sub
From sys import STDERR
From Traceback import Print_exc
Class _dehtmlparser (Htmlparser):
def __init__ (self):
Htmlparser.__init__ (self)
Self.__text = []
def handle_data (self, data):
Text = Data.strip ()
If Len (text) > 0:
Text = Sub (' [\t\r\n]+ ', ' ', text)
Self.__text.append (text + ")
def handle_starttag (self, Tag, attrs):
if tag = = ' P ':
Self.__text.append (' \ n ')
elif tag = = ' BR ':
Self.__text.append (' \ n ')
def handle_startendtag (self, Tag, attrs):
if tag = = ' BR ':
Self.__text.append (' \ n ')
def text (self):
Return '. Join (Self.__text). Strip ()
def dehtml (text):
Try
Parser = _dehtmlparser ()
Parser.feed (text)
Parser.close ()
Return Parser.text ()
Except
Print_exc (File=stderr)
return text
def main ():
Text = r ""
Project:Dehtml
Description:
This small script was intended to allow conversion from HTML markup to
Plain text.
'''
Print (dehtml (text))
if __name__ = = ' __main__ ':
Main ()
Operation Result:
>>> ================================ RESTART ================================
>>>
Project:dehtml
Description:
This small script was intended to allow conversion from the HTML markup to plain text.
Hopefully this article will help you with Python programming.