The body is generally the longest part of the Web page. how to extract the body is the core part. Because if you cannot extract the originalArticleContent and style, then the searched items will be miserable, and there is no use value at all.
Many extraction modes have been referenced in the text extraction module, including configuration templates and Visual matching. keyword recognition is available. after analysis, it is unrealistic to configure the touch version first, because
When it comes to technical information, I don't know which website I will find, and I have no energy to configure the touch version. So this does not work. Visual Effect-based analysis is difficult and only suitable for standard websites,
At present, many websites are not standardized, and Ad links are everywhere. people leave their best positions to advertisements. I have always doubted the feasibility of this model. It is just a good speculation. so this is not done much.
Try.
==========================================================
Text Extraction published on the InternetAlgorithmYou can test which one is better.
Word network-Beijing word Network Technology Co., Ltd.
Http://demo.cikuu.com/cgi-bin/cgi-contex
Rabbit hunting webpage Text Extraction
Http://www.lietu.com/extract/
Php web page Text Extraction
Http://www.woniu.us/get_content_demo/
Webpage body extraction analysis (DEMO)
Http: // 61.128.196.27/txt
I personally think that http: // 61.128.196.27/txt
This is the best way to extract, basically any page can be extracted, and can effectively maintain the original style, pictures, links.