"Python Tutorial" Web page body and content image extraction algorithm

Source: Internet
Author: User
When crawling the content of a single Web site, regular matching is usually used, but the structure of different sites is strange and difficult to match with a uniform regular expression. The author of the general Web page body extraction algorithm based on the block distribution function summarizes the method of extracting the body of the article from the Web page, proposes the text extraction algorithm based on the block distribution, and gives the implementation of PHP and Java. The main principle of this algorithm is based on two points: 1, the body area density: After all the tags in the HTML, the text area character density is higher, less multiple lines blank; 2, the length of the BLOCK: the contents of the non-body area are generally shorter in individual labels (row blocks). The algorithm steps are as follows:

1, remove all tags, including the style, JS script content, but retain the original newline character \ n

2, the content of the Web page is divided into rows, the definition of the row block $block _i$ for the $[i, i + blocksize]$ the sum of the text and give the travel block length based on the line number distribution function:


3, the body appears in the longest line block, intercept both sides to the line block length of 0 range:


4, if you need to extract the text area of the picture, only need to remove tag in the first step to retain the contents of the tag:


The above is the "Python Tutorial" page body and content image extraction algorithm content, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.