Python crawler from Getting started to discarding (vi) the use of the BeautifulSoup library

Last Update:2017-06-01 Source: Internet

Author: User

Tags tag name

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tag: Hello requires element ROM example structure format TTL nbsp

The last article of the regular, in fact, for many people to use it is inconvenient, coupled with the need to remember a lot of rules, so use is not particularly skilled, and this section we mentioned BeautifulSoup is a very powerful tool, crawler weapon.

BeautifulSoup "Delicious soup, Green bisque"

A flexible and convenient page parsing library, processing efficient, support a variety of parsers.
It can not be used to write regular expressions and it is convenient to crawl the Web page information.

Quick to use

Here's an example of a simple understanding of BS4 and a look at its strengths:

 fromBs4Importbeautifulsouphtml=" "" "Soup= BeautifulSoup (HTML,'lxml')Print(Soup.prettify ())Print(Soup.title)Print(Soup.title.name)Print(soup.title.string)Print(Soup.title.parent.name)Print(SOUP.P)Print(soup.p["class"])Print(SOUP.A)Print(Soup.find_all ('a'))Print(Soup.find (id='Link3'))

The results are as follows:

Using BeautifulSoup to parse this code, you can get a BeautifulSoup object and be able to output the structure in the standard indented format.

At the same time, we can get all the links and the text content by the following code separately:

 for  in Soup.find_all ('a'):    Print(link.get ('  href'))print(Soup.get_text ())

Parser

Beautiful soup supports the HTML parser in the Python standard library and supports some third-party parsers, and if we do not install it, Python uses the Python default parser, which is more powerful, faster, and recommended to install than the lxml parser.

The following are common parsers:

It is recommended to use lxml as a parser because it is more efficient. Prior to 3.2.2 versions prior to Python2.7.3 and Python3, lxml or html5lib must be installed because the HTML parsing methods built into the Python version of the standard library are not stable enough.

Basic use

Tag Selector

In quick use we add the following code:
Print (Soup.title)
Print (Type (soup.title))
Print (Soup.head)
Print (SOUP.P)

With this soup. Tag name We can get the contents of this tag
Here's a question to note, in this way to get the label, if there are multiple such tags in the document, the result returned is the contents of the first label, as above we get P tag through SOUP.P, and the document has more than one P tag, but only the first P label content is returned

Get Name

When we pass the soup.title.name, we can get the name of the title tag, which is the title

Get Properties

Print (soup.p.attrs[' name '])
Print (soup.p[' name '])
The Name property value of the P tag can be obtained in either of the above two ways

Get content

Print (soup.p.string)
As a result, you can get the contents of the first P tag:
The Dormouse ' s story

Nested selection

We can get it directly from the nested way below

Print (soup.head.title.string)

Child nodes and descendant nodes

Use of Contents
This is illustrated by the following example:

HTML ="""""" fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(soup.p.contents)

The result is that all the sub-labels under the p tag are stored in a list

The following elements are stored in the list

Use of children

It is also possible to get all the child nodes under the P tag in the same way as the results obtained by contents are the same, but the difference is that Soup.p.children is an iterative object, not a list, and can only get the information that is known through a circular way.

Print (Soup.p.children)  for inch Enumerate (Soup.p.children):     Print (I,child)

Through contents and children are all acquired child nodes, if you want to get descendants of the node can be descendants
Print (soup.descendants) The result of this acquisition is also an iterator

Parent and ancestor nodes

The information of the parent node can be obtained by soup.a.parent

The ancestor node can be obtained through list (enumerate (soup.a.parents)), the result of which is a list of the parent node of the a tag is stored in the list, and the parent node is placed in the list, and finally the entire document is put into the list , the last element of all the lists, and the second-lowest element, are the information for the entire document.

Brother Node

Soup.a.next_siblings get the rear sibling node
Soup.a.previous_siblings get the previous sibling node
Soup.a.next_sibling Get Next brother Tag
Souo.a.previous_sinbling Get Previous Sibling tags

Standard selector Find_all

Find_all (Name,attrs,recursive,text,**kwargs)
You can find documents based on tag name, properties, content

Use of name

Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.find_all ('ul'))Print(Type (Soup.find_all ('ul') [0]))

The result is a list of ways to return

At the same time we can find_all the results again to get all the Li tag information

 for  in Soup.find_all ('ul'):    Print(Ul.find_all (' Li '))

Attrs

Examples are as follows:

Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.find_all (attrs={'ID':'list-1'}))Print(Soup.find_all (attrs={'name':'Elements'}))

Attrs can pass in a dictionary to find a label, but here's a special class, because class is a special field in Python, so if you want to find class-related can change attrs={' class_ ': ' Element '} or Soup.find_all (", {" class ":" Element} "), special tag attributes can not write attrs, such as Id,class, etc.

Text

Examples are as follows:

Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.find_all (text='Foo'))

The result is the text of all the text= ' Foo ' found.

Find

Find (Name,attrs,recursive,text,**kwargs)
Find returns the first element of a matching result

Some other similar usage:
Find_parents () Returns all ancestor nodes, Find_parent () returns the immediate parent node.
Find_next_siblings () returns all the sibling nodes behind, find_next_sibling () returns to the first sibling node.
Find_previous_siblings () returns all previous sibling nodes, and find_previous_sibling () returns the first sibling node in front of it.
Find_all_next () returns all eligible nodes after a node, Find_next () returns the first eligible node
Find_all_previous () returns all eligible nodes after a node, find_previous () returns the first eligible node

CSS Selector

You can complete the selection by directly passing in the CSS selector via select ()
People who are familiar with the front end may know more about CSS, but the same is true of usage.
. Represents class #表示id
Label 1, label 2 Find all tags 1 and label 2
Label 1 label 2 find label 1 internal all tags 2
[attr] can find all tags with a property in this way
[Atrr=value] example [Target=_blank] means finding labels for all Target=_blank

Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.select ('. Panel. Panel-heading'))Print(Soup.select ('ul Li'))Print(Soup.select ('#list-2. Element'))Print(Type (Soup.select ('ul') [0]))

Get content

Text content can be obtained by Get_text ()

Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml') forLiinchSoup.select ('Li'):    Print(Li.get_text ())

Get Properties
Or the property can be passed by [property name] or attrs[property name]

Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml') forUlinchSoup.select ('ul'):    Print(ul['ID'])    Print(ul.attrs['ID'])

Summarize

It is recommended to use the Lxml parsing library, using html.parser if necessary
Label selection filtering is weak but fast
We recommend that you use the Find (), Find_all () query to match a single result or multiple results
Use SELECT () if you are familiar with the CSS selector
Remember common ways to get properties and text values

Python crawler from Getting started to discarding (vi) the use of the BeautifulSoup library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More