Webpage de-noise, get webpage body related open source projects

Source: Internet
Author: User

(1) webpage Denoising

For webpage noise reduction, you need to remove texts that are irrelevant to the content displayed on the webpage, such as advertisements and comments. Nowadays, there are already many applications for blog and news webpage denoising, such as frequently-used Evernote and youdao note, which use related technologies.

Because of the needs of the project, we also need to de-noise the webpage and leave useful content. So I found the relevant open-source project for webpage denoising on the Internet.

(2) Reference Links

The main reference link is this "webpage Text Extraction Tool", which should be the Weibo content captured on Sina Weibo. This section describes the project addresses, including Java, C ++, C #, Perl, and python.

Because the project is written in Python, decruft, Python readability, Python boilerpipe, and pyhon goose are selected.

(3) Practice

Use of Python Readability:

 

 
From readability. Readability import document import urllib html = urllib. urlopen (URL). Read () readable_article = Document (HTML). Summary () readable_title = Document (HTML). short_title ()

The extracted readable_article is text with HTML tags. You also need to perform the clean HTML operation. If you want to obtain plain text content, you need to do other work.

 

"Decruft is a fork of Python-readability to make it faster. It also has some logic corrections and improvements along the way."(From: http://www.minvolai.com/blog/decruft-arc90s-readability-in-python)

Decruft is the fork version of Python readability, which improves the readability speed. Decruft's source code is put on goolge, and found that he only has version 0.1, and it was in September, but Python-readability has been updated, and its core readability. PY was updated seven months ago, so it cannot be guaranteed that the performance of decruft is better than the current readability. I didn't download decruft for testing. If you are interested, please try it yourself.

Python-boilerpipe: it is the Python version of boilerpipe. It depends on jpype and chardet when used. You can customize the extractors you need when constructing an extractor. For details, see:

Defaultextractorarticleextractorarticlesentencesextractorkeepeverythingextractorkeepeverythingwithminkwordsextractorlargestcont entextractornumwordsrulesextractorcanolaextractor

This project can select the extracted body content format: Either plain text or HTML.

 

Python-Goose:

After the test, decided to use goose, on this web site can test the extraction effect of http://jimplush.com/blog/goose goose. Goose can also obtain the meta description.

Goose can finally obtain extracted plain text.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.