Python crawler tutorial -24-Data Extraction-BEAUTIFULSOUP4 (ii)
This article describes how BS traverses a Document object
Traversing Document objects
- Contents:tag child nodes are exported as a list
- Children: Child nodes are returned as iterators
- Descendants: All descendant nodes
- String: Prints the specific contents of the label with a string without a label, only the content
- Case code 27bs3.py file: https://xpwi.github.io/py/py%E7%88%AC%E8%99%AB/py27bs3.py
# BeautifulSoup 的使用案例# 遍历文档对象from urllib import requestfrom bs4 import BeautifulSoupurl = ‘http://www.baidu.com/‘rsp = request.urlopen(url)content = rsp.read()soup = BeautifulSoup(content, ‘lxml‘)# bs 自动解码content = soup.prettify()print("=="*12)# 使用 contentsfor node in soup.head.contents: if node.name == "meta": print(node) if node.name == "title": print(node.string)print("=="*12)
Run results
Commonly used string to print out the specific contents of the label, without a label, only the content
Of course, if you think the traversal is too resource-intensive, there is no need to traverse the time, you can use the search
Searching for Document objects
- Find_all (name, Attrs, recursive, text, * * Kwargs)
- uses Find_all (), which returns the list format, i.e. if Find_all (name= ' meta ' ), if more than one meta is returned as a list of
- name parameters: which character to search, the content that can be passed in is
- 1. String
- 2. Regular expressions, using regular to compile:
For example : We need to print all the tags that start with me
tags = soup.find_all (re.compile (' ^me '))
- 3. Can also be a list
The
- keyword parameter, which you can use to represent a property
- text: literal value corresponding to tag
- Case Code 27bs4.py file: https://xpwi.github.io/py/py%E7%88%AC%E8%99%AB/py27bs4.py
# BeautifulSoup 的使用案例# 搜索文档对象from urllib import requestfrom bs4 import BeautifulSoupimport reurl = ‘http://www.baidu.com/‘rsp = request.urlopen(url)content = rsp.read()soup = BeautifulSoup(content, ‘lxml‘)# bs 自动解码content = soup.prettify()# 使用 find_all# 使用 name 参数print("=="*12)tags = soup.find_all(name=‘link‘)for i in tags: print(i)# 使用正则表达式print("=="*12)# 同时使用两个条件tags = soup.find_all(re.compile(‘^me‘), content=‘always‘)# 这里直接打印 tags 会打印一个列表for i in tags: print(i)
Run results
Because two conditions are used, only one meta is matched
Next introduction, BeautifulSoup CSS Selector
Bye
-This note does not allow any person or organization to reprint
Python crawler tutorial -24-Data Extraction-BEAUTIFULSOUP4 (ii)