(1) webpage Denoising
For webpage noise reduction, you need to remove texts that are irrelevant to the content displayed on the webpage, such as advertisements and comments. Nowadays, there are already many applications for blog and news webpage denoising, such as frequently-used Evernote and youdao note, which use related technologies.
Because of the needs of the project, we also need to de-noise the webpage and leave useful content. So I found the relevant open-source project for webpage denoising on the Internet.
(2) Reference Links
The main reference link is this "webpage Text Extraction Tool", which should be the Weibo content captured on Sina Weibo. This section describes the project addresses, including Java, C ++, C #, Perl, and python.
Because the project is written in Python, decruft, Python readability, Python boilerpipe, and pyhon goose are selected.
(3) Practice
Use of Python Readability:
From readability. Readability import document import urllib html = urllib. urlopen (URL). Read () readable_article = Document (HTML). Summary () readable_title = Document (HTML). short_title ()
The extracted readable_article is text with HTML tags. You also need to perform the clean HTML operation. If you want to obtain plain text content, you need to do other work.
"Decruft is a fork of Python-readability to make it faster. It also has some logic corrections and improvements along the way."(From: http://www.minvolai.com/blog/decruft-arc90s-readability-in-python)
Decruft is the fork version of Python readability, which improves the readability speed. Decruft's source code is put on goolge, and found that he only has version 0.1, and it was in September, but Python-readability has been updated, and its core readability. PY was updated seven months ago, so it cannot be guaranteed that the performance of decruft is better than the current readability. I didn't download decruft for testing. If you are interested, please try it yourself.
Python-boilerpipe: it is the Python version of boilerpipe. It depends on jpype and chardet when used. You can customize the extractors you need when constructing an extractor. For details, see:
Defaultextractorarticleextractorarticlesentencesextractorkeepeverythingextractorkeepeverythingwithminkwordsextractorlargestcont entextractornumwordsrulesextractorcanolaextractor
This project can select the extracted body content format: Either plain text or HTML.
Python-Goose:
After the test, decided to use goose, on this web site can test the extraction effect of http://jimplush.com/blog/goose goose. Goose can also obtain the meta description.
Goose can finally obtain extracted plain text.