Python data collection of Web page parser __python

Source: Internet
Author: User
Tags object model
1.Python Web page parser 1.1 Web page parser Introduction

A Web page parser is a tool that extracts " valuable data " or " new URL links " from an HTML Web page.

The Web page parsing process is shown in the following illustration:

1.2 python web parser

The common Python Web parser is the regular expression (re), Python html.parser, third-party library beautifulsoup and lxml four species.
The four web-page parsers can be divided into fuzzy matching models represented by re regular expressions and BeautifulSoup,html.parser,lxml represents a structured parsing pattern . Among them, the structural parsing mode takes the DOM tree structure as the standard, carries on the extraction of the tag structure information .

1.3 dom Tree

The DOM(Document Object model) tree, which is the documentation objects model , has a tree-shaped label structure as shown in the following illustration:

The structural parsing mentioned above means that the Web page parser will download the entire HTML document as a Document object , and then use the tag form of its upper and lower structure to traverse and Information Extraction operations.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.