In chapter three, P87 has a piece of code that deals with HTML:
>>>raw = nltk.clean_html (html)>>>tokens = nltk.word_tokenize (raw)>>> Tokens
But we do have the following error:
>>> raw =nltk.clean_html (HTML) Traceback (most recent call last): File"<stdin>", Line 1,inch<module>File"/library/python/2.7/site-packages/nltk/util.py", line 356,inchclean_htmlRaiseNotimplementederror ("to remove HTML markup, use BeautifulSoup ' s Get_text () function") notimplementederror:to Remove HTML markup, use BeautifulSoup's Get_text () function
According to the official website: Introduction http://www.nltk.org/_modules/nltk/util.html
def clean_html (HTML):
Raise Notimplementederror ("to-remove HTML markup, use BeautifulSoup ' s get_text () function")
[docs]def clean_url (URL):
Raise Notimplementederror ("to-remove HTML markup, use BeautifulSoup ' s get_text () function")
Website: http://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript Introduction:
Later versions, it does not seem to support clean_html () and Clean_url () these two functions
Support for clean_html and Clean_url is dropped for the future versions of NLTK. Please use the BeautifulSoup for now...it ' s very unfortunate.
For information about working with HTML, you can use the beautiful Soup package on http://www.crummy.com/software/BeautifulSoup/.
Installation: sudo pip install Beautifulsoup4
Then replace the code on the book:
from __future__ ImportDivisionImportNLTK, Re, pprint fromUrllibImportUrlopen fromBs4ImportBeautifulSoupdefread_html (): URL="http://news.bbc.co.uk/2/hi/health/2284783.stm"HTML=urlopen (URL). Read () Soup=BeautifulSoup (HTML) text=Soup.get_text ()PrintText Tokens=nltk.word_tokenize (text)PrintTokensdefMain (): read_html ()if __name__=='__main__': Main ()
The above script files can be run independently, and the result is consistent with the book
Python Natural Language Processing-Learning Note: Chapter3 error correction