Introduction to Dynamic Website Basics Tutorial Recommended

Source: Internet
Author: User
When crawling the content of a single Web site, regular matching is usually used, but the structure of different sites is strange and difficult to match with a uniform regular expression. The author of the general Web page body extraction algorithm based on the block distribution function summarizes the method of extracting the body of the article from the Web page, proposes the text extraction algorithm based on the block distribution, and gives the implementation of PHP and Java. The main principle of this algorithm is based on two points: 1, the body area density: After all the tags in the HTML, the text area character density is higher, less multiple lines blank; 2, the length of the BLOCK: the contents of the non-body area are generally shorter in individual labels (row blocks). The algorithm steps are as follows: 1, the removal of all tags, including the style, JS script content, but retain the original line break \n2, the page content is divided into rows, the definition of the row block $block _i$ for the first $[i, i + blocksize]$ line of text and the sum of the travel block length based on the distribution function of the row number: 3, The body appears in the longest row block, intercepting both sides to the line block length of 0 of the range: 4, if you need to extract the picture of the body area, only need to remove the tag when the first step to preserve <im

1. "Python Tutorial" Web page body and content image extraction algorithm

Introduction: Crawling The content of a single Web site is usually a regular match, but the structure of different sites are strange, it is difficult to use a uniform regular expression to match. The author of the general Web page body extraction algorithm based on the block distribution function summarizes the method of extracting the body of the article from the Web page, proposes the text extraction algorithm based on the block distribution, and gives the implementation of PHP and Java. The main principle of this algorithm is based on two points:

2. PHP Extract page Body content Example _php tutorial

Introduction: PHP Extracts the content of the text of the page example. PHP Extract page Body Content example because the difficulty is how to identify and keep the article part of the Web page, and delete other useless information, and to achieve generalization, not like a train

3. Where does the body information of the Web page generally be stored _html/css_web-itnose

Summary: Where the body information of the Web page is generally stored

4. PHP extracts the contents of the page text example

Introduction: PHP Extracts the content of the text of the page example. PHP Extract page Body Content example because the difficulty is how to identify and keep the article part of the Web page, and delete other useless information, and to achieve generalization, not like a train

5. Deep analysis using Python to crawl the text source of the Web page

Introduction: Usually open a Web page, in addition to the text content of the article, usually there will be a lot of navigation, advertising and other information. The purpose of this article is to show how to extract the body content of an article from a Web page, and to transition out of other irrelevant information.

6. JavaScript changes the font size method collection [Original]_javascript Tips

Introduction: To provide the body of the Web page, small middle and junior types of font switching function. Set the FontSize property of the div style with the JS code.

7. JS Gets the height and width of the DOM (visible areas and sections, etc.) _javascript tips

Description: The Web page visible area is wide or high, the full text of the body is wide or high, and the body part of the page is left or right, see below for details, hope to help you

"Related question and answer recommendation":

Objective-c-IOS Web page body extract Open Source Library

JavaScript-Evernote Chrome plugin How to implement the principle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.