Python crawler--4-3.beautifulsoup4 (BS4) __python

Source: Internet
Author: User
Tags object model xpath
For html/xml data filtering, BeautifulSoup is also more commonly used and simple technology, BeautifulSoup is a very elegant dedicated to html/xml data analysis of a descriptive language, can be well analyzed and screened html/ The specified rule data in a markup document such as XML
In the process of data filtering, the basic technology is a DOM operation implemented by encapsulating the HTML DOM tree, and obtains the target data from the Document object tree model by loading the Web page document object.
BeautifulSoup operation is simple and easy to use, in many for the data filtering performance requirements are not particularly harsh projects often used, the current market popular operating version is BEAUTIFULSOUP4, often called BS4
I. XPath and BEAUTIFULSOUP4XPath and BeautifulSoup are all DOM-based operating patterns
The difference is in the document node traversal query operation process that occurs when loading the Document Object Model DOM, while XPath iterates through the local Dom object tree for the syntax structure specified by the description language, but BS4 in the process of operation, The entire document tree is loaded and then queried for matching operations, consuming more resources in the process and less processing performance relative to XPath
So why use BS4? Because, it, simple enough!
Description Language | handling Efficiency | hands-on level
Regular Expression | Very High Efficiency | Difficult
Xpath | Efficiency is High | Normal
BS4 | High Efficiency | Simple
BS4 itself is a functional operation module that encapsulates the description language, by providing object-oriented operations to encapsulate the various nodes, tags, attributes, content and so on in the Document object into the properties of objects in Python, the query operation process, by calling the specified function directly to the data matching retrieval operations, Very simple and very flexible.
General BS4 converts an HTML document object into a document tree of the following four types of combinations
* Tag: Tag object
* Navigablestring: Character content Manipulation object
* BeautifulSoup: Document objects
* Comment: Special type of navigablestring


Here, in fact, there are too many theoretical grammars, BS4 is different from regular and XPath, there is no basic grammatical structure, it encapsulates the object and the object's attribute operation, is BS4 the core value of extraordinary
Let ' s on dry Goods


two., Python operation BeautifulSoup4Python support for BeautifulSoup by installing a Third-party module to perform its best operation
```
$ pip Install Beautifulsoup4
1. Getting Started first bomb: Understanding BeautifulSoup4```
# Coding:utf-8
# Introduction of parsing module BS4
From BS4 import BeautifulSoup


# load HTML page from file, specify HTML parser to use lxml
# By default, BS4 automatically matches the highest-priority parser in the current system.
Soup = BeautifulSoup (Open ("index.html"), "lxml")
# If the crawler gets the character data, just give it to BS4 OK pull
# soup = Beatufulsoup (spider_content, "lxml")


# Print the BeautifulSoup Document object and get the document tree content
Print (soup)
# Print Type: <class ' BS4. BeautifulSoup ' >
Print (type (soup))
```
2. Getting started second bullet: Operation tag, attributes, content```
# Coding:utf-8


From BS4 import BeautifulSoup


# get the built Document object
Soup = BeautifulSoup (Open ("index.html"), "lxml")


# tag operation
# 1. Get labels
Print (soup.title) # <title> article title </title>
Print (SOUP.P) # <p> name: <span id= "name" > Grand Shepherd </span></p> # only the first matching tag object is returned
Print (Soup.span) # <span id= ' name ' > Da mu </span>


# 2. Get the properties of the label
Print (Soup.p.attrs) # {}: A dictionary that gets properties and values
Print (soup.span.attrs) # {' id ': ' Name '}: A dictionary that gets properties and values
Print (soup.span[' id ']) # Name: Gets the value of the specified property
soup.span[' id ' = "real_name"
Print (soup.span[' id ']) # Real_name: You can easily modify the document directly in BS4


# 3. Get the contents of a label
Print (soup.head.string) # article title: If there is only one child tag in the label ~ Returns the text content in the child label
Print (soup.p.string) # None: If there are multiple child labels in the label, return none
Print (soup.span.string) # Shepherd: Direct return of included text content
``` 3. Getting started third bullet: manipulating child nodes '# Coding:utf-8
# Introduction of BS4 Operation module
From BS4 import BeautifulSoup

# Load Web document, build Document Object
Soup = BeautifulSoup (Open ("index.html"), "lxml")

Print (dir (soup))

Print (soup.contents) # gets all the child nodes in the Document object
Print (soup.div.contents) # Gets a list of the child nodes of the first div to match
Print (Soup.div.children) # The child node list iterator of the first div to be matched to
# for E1 in Soup.div.children:
# print ("-->", E1)
Print (soup.div.descendants) # Gets the child node iterator of the first div that matches to, and all descendant nodes are listed individually one by one
# for E2 in Soup.div.descendants:
# print ("==>", E2)
```
4. Getting started: Object-oriented DOM matching```
# Coding:utf-8
# introduce BS4 module
From BS4 import BeautifulSoup

# Loading Document objects
Soup = BeautifulSoup (open (). /index.html ")," lxml ")

# DOM Document Tree Query
# core Function ~ Please compare the JAVASRIPT DOM structure to understand its method
# such as: Findallprevious ()/findallnext ()/findall ()/findprevious ()/findnext (), etc.
# FindAll () for example
# 1. Query the specified string
Res1 = Soup.findall ("P") # query all tags that contain p characters
Print (RES1)

# 2. Regular expressions
Import re
Res2 = Soup.findall (Re.compile (r "d+")) # query All tags that contain d characters
Print (Res2)

# 3. List: selecting
Res3 = Soup.findall (["Div", "H1"]) # query all div or H1 tags
Print (RES3)

# 4. Keyword parameters
Res4 = Soup.findall (id= "name") # Query property is id= ' name ' label
Print (RES4)

# 5. Content Matching
RES5 = Soup.findall (text=u "male") # directly matches the characters in the content, you must ensure exact match
Print (RES5)
RES6 = Soup.findall (text=[u "article title", U "da Mu"]) # query all tags that contain exact content
Print (RES6)
Res7 = Soup.findall (Text=re.compile (U "Big +")) # fuzzy matching through regular expressions
Print (RES7)


```
5. Primer Five: See also CSS```
# Coding:utf-8

# Introduction of BS Module
From BS4 import BeautifulSoup

# Load Web page Build Document Object
Soup = BeautifulSoup (Open ("index.html"), "lxml")

# 1. CSS Tag Selector: Query label object based on label name
Res1 = Soup.select ("span")
Print (RES1)

# 2. CSS ID Selector: Query label object based on ID
Res2 = Soup.select ("#gender")
Print (Res2)

# 3. CSS class selector: Querying tag objects based on class attributes
Res3 = Soup.select (". Intro")
Print (RES3)

# 4. CSS Property Selector
RES41 = Soup.select ("Span[id]")
Print (RES41)
Res42 = Soup.select ("span[id= ' gender ')")
Print (RES42)

# 5. CSS contains selectors
RES5 = Soup.select ("P span#name")
Print (RES5)

# 6. Get the label content
RES6 = Soup.select ("p > Span.intro")
Print (res6[0].string)
Print (Res6[0].gettext ())

```

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.