When crawling the content of a single Web site, regular matching is usually used, but the structure of different sites is strange and difficult to match with a uniform regular expression. The author of the general Web page body extraction algorithm based on the block distribution function summarizes the method of extracting the body of the article from the Web page, proposes the text extraction algorithm based on the block distribution, and gives the implementation of PHP and Java. The main principle of this algorithm is based on two points: 1, the body area density: After all the tags in the HTML, the text area character density is higher, less multiple lines blank; 2, the length of the BLOCK: the contents of the non-body area are generally shorter in individual labels (row blocks). The algorithm steps are as follows:
1, remove all tags, including the style, JS script content, but retain the original newline character \ n
2, the content of the Web page is divided into rows, the definition of the row block $block _i$ for the $[i, i + blocksize]$ the sum of the text and give the travel block length based on the line number distribution function:
3, the body appears in the longest line block, intercept both sides to the line block length of 0 range:
4, if you need to extract the text area of the picture, only need to remove tag in the first step to retain the contents of the tag:
The above is the "Python Tutorial" page body and content image extraction algorithm content, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!