The text extraction algorithm published on the internet can be compared comprehensively.

Source: Internet
Author: User

The body is generally the longest part of the Web page. how to extract the body is the core part. Because if you cannot extract the originalArticleContent and style, then the searched items will be miserable, and there is no use value at all.

Many extraction modes have been referenced in the text extraction module, including configuration templates and Visual matching. keyword recognition is available. after analysis, it is unrealistic to configure the touch version first, because
When it comes to technical information, I don't know which website I will find, and I have no energy to configure the touch version. So this does not work. Visual Effect-based analysis is difficult and only suitable for standard websites,
At present, many websites are not standardized, and Ad links are everywhere. people leave their best positions to advertisements. I have always doubted the feasibility of this model. It is just a good speculation. so this is not done much.
Try.


==========================================================

Text Extraction published on the InternetAlgorithmYou can test which one is better.


Word network-Beijing word Network Technology Co., Ltd.
Http://demo.cikuu.com/cgi-bin/cgi-contex


Rabbit hunting webpage Text Extraction
Http://www.lietu.com/extract/


Php web page Text Extraction
Http://www.woniu.us/get_content_demo/


Webpage body extraction analysis (DEMO)
Http: // 61.128.196.27/txt


I personally think that http: // 61.128.196.27/txt
This is the best way to extract, basically any page can be extracted, and can effectively maintain the original style, pictures, links.

 

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.