When crawling the content of a single Web site, regular matching is usually used, but the structure of different sites is strange and difficult to match with a uniform regular expression. The author of the general Web page body extraction algorithm based on the block distribution function summarizes the method of extracting the body of the article from the Web page, proposes the text extraction algorithm based on the block distribution, and gives the implementation of PHP and Java. The main principle of this algorithm is based on two points: 1, the body area density: After all the tags in the HTML, the text area character density is higher, less multiple lines blank; 2, the length of the BLOCK: the contents of the non-body area are generally shorter in individual labels (row blocks). The algorithm steps are as follows: 1, the removal of all tags, including the style, JS script content, but retain the original line break \n2, the page content is divided into rows, the definition of the row block $block _i$ for the first $[i, i + blocksize]$ line of text and the sum of the travel block length based on the distribution function of the row number: 3, The body appears in the longest row block, intercepting both sides to the line block length of 0 of the range: 4, if you need to extract the picture of the body area, only need to remove the tag when the first step to preserve <im
1. "Python Tutorial" Web page body and content image extraction algorithm
Introduction: Crawling The content of a single Web site is usually a regular match, but the structure of different sites are strange, it is difficult to use a uniform regular expression to match. The author of the general Web page body extraction algorithm based on the block distribution function summarizes the method of extracting the body of the article from the Web page, proposes the text extraction algorithm based on the block distribution, and gives the implementation of PHP and Java. The main principle of this algorithm is based on two points:
2. PHP Extract page Body content Example _php tutorial
Introduction: PHP Extracts the content of the text of the page example. PHP Extract page Body Content example because the difficulty is how to identify and keep the article part of the Web page, and delete other useless information, and to achieve generalization, not like a train
3. Where does the body information of the Web page generally be stored _html/css_web-itnose
Summary: Where the body information of the Web page is generally stored
4. PHP extracts the contents of the page text example
Introduction: PHP Extracts the content of the text of the page example. PHP Extract page Body Content example because the difficulty is how to identify and keep the article part of the Web page, and delete other useless information, and to achieve generalization, not like a train
5. Deep analysis using Python to crawl the text source of the Web page
Introduction: Usually open a Web page, in addition to the text content of the article, usually there will be a lot of navigation, advertising and other information. The purpose of this article is to show how to extract the body content of an article from a Web page, and to transition out of other irrelevant information.
6. JavaScript changes the font size method collection [Original]_javascript Tips
Introduction: To provide the body of the Web page, small middle and junior types of font switching function. Set the FontSize property of the div style with the JS code.
7. JS Gets the height and width of the DOM (visible areas and sections, etc.) _javascript tips
Description: The Web page visible area is wide or high, the full text of the body is wide or high, and the body part of the page is left or right, see below for details, hope to help you
"Related question and answer recommendation":
Objective-c-IOS Web page body extract Open Source Library
JavaScript-Evernote Chrome plugin How to implement the principle