1.Python Web page parser
1.1 Web page parser Introduction
A Web page parser is a tool that extracts " valuable data " or " new URL links " from an HTML Web page.
The Web page parsing process is shown in the following illustration:
1.2 python web parser
The common Python Web parser is the regular expression (re), Python html.parser, third-party library beautifulsoup and lxml four species.
The four web-page parsers can be divided into fuzzy matching models represented by re regular expressions and BeautifulSoup,html.parser,lxml represents a structured parsing pattern . Among them, the structural parsing mode takes the DOM tree structure as the standard, carries on the extraction of the tag structure information .
1.3 dom Tree
The DOM(Document Object model) tree, which is the documentation objects model , has a tree-shaped label structure as shown in the following illustration:
The structural parsing mentioned above means that the Web page parser will download the entire HTML document as a Document object , and then use the tag form of its upper and lower structure to traverse and Information Extraction operations.