Python Development Lightweight Crawler 07

Last Update:2016-09-05 Source: Internet

Author: User

Tags tag name

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python Development Lightweight Crawler (IMOOC summary 07--page parser BeautifulSoup)

BeautifulSoup download and install using PIP install: After command line cmd input, pip install Beautifulsoup4beautifulsoup syntax is divided into three parts.    First, based on the downloaded HTML page string, we create a BeautifulSoup object that, when created, downloads the entire document string as a DOM tree. Then, based on this DOM tree, we can search for a variety of nodes, here are two ways to Find_all/find. The Find_all method searches for all nodes that meet the requirements, and the Find method only searches for the first node that satisfies the requirement.    The parameters of the two methods are identical. After getting a node, we can access the name of the node, properties, text, corresponding, in the search node, we can also follow the node name search, according to the attributes of the node search, or according to the text of the node searched, here The node content is divided into names, attributes, text. Let us illustrate.    Here is a link on the webpage: <a href= ' 123.html ' class= ' article_link ' > Python </a> How do we search for such a link? By node Name: A node property: href = ' 123.html ' node property: class= ' article_link ' node content: Python uses these three ways to search and access the corresponding code: first create the BeautifulSoup    Like, from BS4 import BeautifulSoup.    Creates a BeautifulSoup object based on the downloaded HTML page string. We can pass in three parameters: Soup = {html_doc, #html文档字符串 ' Html.parser ', #html解析器 From_enco ding= ' Utf-8 ' #html文档的编码} if the encoding of the page and code is inconsistent, the parsing process will appear garbled.    The encoding can be specified here.    In this way, we created the BS object and downloaded the DOM. Second, the Search node (find_all, find) #方法: Find_all (name, Attrs,String) node name, node attribute, node literal #查找所有标签为a的节点 soup.find_all (' a ') #查找所有标签为a, linking nodes that conform to/view/123.htm form SOUP.F Ind_all (' A ', href= '/view/123.htm ') BS is a very powerful place where names, attributes, and literals can be passed into a regular expression to match the corresponding content Soup.find_all (' A ', href= ' re.c Ompile (R '/view/\d+\.htm ')) #查找所有标签为div, class is ADC, the text is Python node soup.find_all (' div ', class_= ' ADC ', string= ' Pytho    n ') in order to avoid conflicts with Python, the class is underlined as follows: Class_, through Find_all, find two methods to search all the nodes in the DOM.    Finally, the node's information can be accessed later. For example: #得到节点: <a href = ' 1.html ' >python</a> we can get the tag name of the node we found.    Node.name #获取查找到的节点的href属性 node[' href '] can access all properties in the form of a dictionary.    #获取查找到的a节点的链接文字 Node.get_text () method to get the text through the above to create a BS object, search the DOM tree, access to the content of the node, we can achieve the entire download good page all the nodes of the resolution and access. Write code to test the various methods of this module of BS?

         fromBs4ImportBeautifulSoupImportre Html_doc=""""""        Print('get all the links') Links= Soup.find_all ('a')         forLinkinchLinks:Print(link.name,link['href'],link.get_text ())Print('get a link to Lacie') Link_node= Soup.find ('a', href='Http://example.com/lacie')        Print(link_node.name,link_node['href'],link_node.get_text ())Print('regular Match') Link_node= Soup.find ('a', Href=re.compile (R"ill"))        Print(link_node.name,link_node['href'],link_node.get_text ()) print (' Get p segment text ')P_node= Soup.find ('P', class_='title')        Print(P_node.name,p_node.get_text ())

Python Development Lightweight Crawler 07

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More