Python Development Lightweight Crawler 07

Source: Internet
Author: User
Tags tag name

Python Development Lightweight Crawler (IMOOC summary 07--page parser BeautifulSoup)
BeautifulSoup download and install using PIP install: After command line cmd input, pip install Beautifulsoup4beautifulsoup syntax is divided into three parts.    First, based on the downloaded HTML page string, we create a BeautifulSoup object that, when created, downloads the entire document string as a DOM tree. Then, based on this DOM tree, we can search for a variety of nodes, here are two ways to Find_all/find. The Find_all method searches for all nodes that meet the requirements, and the Find method only searches for the first node that satisfies the requirement.    The parameters of the two methods are identical. After getting a node, we can access the name of the node, properties, text, corresponding, in the search node, we can also follow the node name search, according to the attributes of the node search, or according to the text of the node searched, here The node content is divided into names, attributes, text. Let us illustrate.    Here is a link on the webpage: <a href= ' 123.html ' class= ' article_link ' > Python </a> How do we search for such a link? By node Name: A node property: href = ' 123.html ' node property: class= ' article_link ' node content: Python uses these three ways to search and access the corresponding code: first create the BeautifulSoup    Like, from BS4 import BeautifulSoup.    Creates a BeautifulSoup object based on the downloaded HTML page string. We can pass in three parameters: Soup = {html_doc, #html文档字符串 ' Html.parser ', #html解析器 From_enco ding= ' Utf-8 ' #html文档的编码} if the encoding of the page and code is inconsistent, the parsing process will appear garbled.    The encoding can be specified here.    In this way, we created the BS object and downloaded the DOM. Second, the Search node (find_all, find) #方法: Find_all (name, Attrs,String) node name, node attribute, node literal #查找所有标签为a的节点 soup.find_all (' a ') #查找所有标签为a, linking nodes that conform to/view/123.htm form SOUP.F Ind_all (' A ', href= '/view/123.htm ') BS is a very powerful place where names, attributes, and literals can be passed into a regular expression to match the corresponding content Soup.find_all (' A ', href= ' re.c Ompile (R '/view/\d+\.htm ')) #查找所有标签为div, class is ADC, the text is Python node soup.find_all (' div ', class_= ' ADC ', string= ' Pytho    n ') in order to avoid conflicts with Python, the class is underlined as follows: Class_, through Find_all, find two methods to search all the nodes in the DOM.    Finally, the node's information can be accessed later. For example: #得到节点: <a href = ' 1.html ' >python</a> we can get the tag name of the node we found.    Node.name #获取查找到的节点的href属性 node[' href '] can access all properties in the form of a dictionary.    #获取查找到的a节点的链接文字 Node.get_text () method to get the text through the above to create a BS object, search the DOM tree, access to the content of the node, we can achieve the entire download good page all the nodes of the resolution and access. Write code to test the various methods of this module of BS?
    
         fromBs4ImportBeautifulSoupImportre Html_doc=""""""        Print('get all the links') Links= Soup.find_all ('a')         forLinkinchLinks:Print(link.name,link['href'],link.get_text ())Print('get a link to Lacie') Link_node= Soup.find ('a', href='Http://example.com/lacie')        Print(link_node.name,link_node['href'],link_node.get_text ())Print('regular Match') Link_node= Soup.find ('a', Href=re.compile (R"ill"))        Print(link_node.name,link_node['href'],link_node.get_text ()) print (' Get p segment text ')P_node= Soup.find ('P', class_='title')        Print(P_node.name,p_node.get_text ())

Python Development Lightweight Crawler 07

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.