Python crawler tutorial -24-Data Extraction-BEAUTIFULSOUP4 (ii)

Source: Internet
Author: User

Python crawler tutorial -24-Data Extraction-BEAUTIFULSOUP4 (ii)

This article describes how BS traverses a Document object

Traversing Document objects
    • Contents:tag child nodes are exported as a list
    • Children: Child nodes are returned as iterators
    • Descendants: All descendant nodes
    • String: Prints the specific contents of the label with a string without a label, only the content
    • Case code 27bs3.py file: https://xpwi.github.io/py/py%E7%88%AC%E8%99%AB/py27bs3.py
# BeautifulSoup 的使用案例# 遍历文档对象from urllib import requestfrom bs4 import BeautifulSoupurl = ‘http://www.baidu.com/‘rsp = request.urlopen(url)content = rsp.read()soup = BeautifulSoup(content, ‘lxml‘)# bs 自动解码content = soup.prettify()print("=="*12)# 使用 contentsfor node in soup.head.contents:    if node.name == "meta":        print(node)    if node.name == "title":        print(node.string)print("=="*12)
Run results


Commonly used string to print out the specific contents of the label, without a label, only the content
Of course, if you think the traversal is too resource-intensive, there is no need to traverse the time, you can use the search

Searching for Document objects
    • Find_all (name, Attrs, recursive, text, * * Kwargs)
      • uses Find_all (), which returns the list format, i.e. if Find_all (name= ' meta ' ), if more than one meta is returned as a list of
      • name parameters: which character to search, the content that can be passed in is
        • 1. String
        • 2. Regular expressions, using regular to compile:
          For example : We need to print all the tags that start with me
          tags = soup.find_all (re.compile (' ^me '))
        • 3. Can also be a list
    • The
    • keyword parameter, which you can use to represent a property
    • text: literal value corresponding to tag
    • Case Code 27bs4.py file: https://xpwi.github.io/py/py%E7%88%AC%E8%99%AB/py27bs4.py
# BeautifulSoup 的使用案例# 搜索文档对象from urllib import requestfrom bs4 import BeautifulSoupimport reurl = ‘http://www.baidu.com/‘rsp = request.urlopen(url)content = rsp.read()soup = BeautifulSoup(content, ‘lxml‘)# bs 自动解码content = soup.prettify()# 使用 find_all# 使用 name 参数print("=="*12)tags = soup.find_all(name=‘link‘)for i in tags:    print(i)# 使用正则表达式print("=="*12)# 同时使用两个条件tags = soup.find_all(re.compile(‘^me‘), content=‘always‘)# 这里直接打印 tags 会打印一个列表for i in tags:    print(i)
Run results


Because two conditions are used, only one meta is matched
Next introduction, BeautifulSoup CSS Selector
Bye

-This note does not allow any person or organization to reprint

Python crawler tutorial -24-Data Extraction-BEAUTIFULSOUP4 (ii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.