Crawlers of a salted fish (2): BeautifulSoup library and crawler beautifulsoup

Last Update:2017-05-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To record the BeautifulSoup method.

BeautifulSoup and the previously mentioned requests libraries are both relatively practical python third-party libraries. By combining them with beginners, you can basically crawl small-scale data.

In the next article, write a small column. Let's talk about the BeautifulSoup Library first.

After installation, the next common library on the internet, ANACONDA, has an integrated development environment.

Or give an official document address: http://beautifulsoup.readthedocs.io/zh_CN/latest/

Import

from bs4 import BeautifulSoup

Html_doc = """
<Html> 
<Body>
<P class = "title"> <B> The Dormouse's story </B> </p>

<P class = "story"> Once upon a time there were three little sisters; and their names were
<A href = "http://example.com/elsie" class = "sister" id = "link1"> Elsie </a>,
<A href = "http://example.com/lacie" class = "sister" id = "link2"> Lacie </a> and
<A href = "http://example.com/tillie" class = "sister" id = "link3"> Tillie </a>;
And they lived at the bottom of a well. </p>

<P class = "story">... </p>
"""
Here is the html demo page of the official document, which is more classic. Here we will use this for demonstration (well, I am actually lazy)

html_doc = """

<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

Output result:

<Html>
<Head>
<Title>
The Dormouse's story
</Title>
</Head>
<Body>


The Dormouse's story



Once upon a time there were three little sisters; and their names were
<A class = "sister" href = "http://example.com/elsie" id = "link1">
Elsie
</A>
,
<A class = "sister" href = "http://example.com/lacie" id = "link2">
Lacie
</A>
And
<A class = "sister" href = "http://example.com/tillie" id = "link3">
Tillie
</A>
;
And they lived at the bottom of a well.


...

</Body>
</Html>

Next we will introduce the parser

Soup = BeautifulSoup ('

Generally, the first version is better than python3.2 and later versions of html. parser.

Beautiful Soup basic class

Tag, the most basic information organization unit, respectively, with <> and </> to indicate the beginning and end

soup = BeautifulSoup('Extremely bold')tag = soup.btype(tag)# <class 'bs4.element.Tag'>

Any tag that exists in the HTML syntax can be obtained by using soup. <tag>

When multiple <tag> corresponding content exists in the HTML document, soup. <tag> returns the first

TAG attributes

TAG Name: the Name of the TAG, ... the name is 'P'. Format: <tag>. name

Each <tag> has its own name, obtained through <tag>. name, string type

Attribute of TAG Attributes tag, which is in dictionary format. Format: <TAG>. attrs

One <tag> can have 0 or more attributes, Dictionary type

The TAG NavigableString TAG contains non-attribute strings. <>... </> String in the format of <tag>. string

NavigableString can span multiple layers

TAG CommentObject is a special typeNavigableStringObject:

Bs-based HTML content Traversal method

Html_doc = """
<Html>
<Body>
 The Dormouse's story 

 Once upon a time there were three little sisters; and their names were
<A href = "http://example.com/elsie" class = "sister" id = "link1"> Elsie </a>,
<A href = "http://example.com/lacie" class = "sister" id = "link2"> Lacie </a> and
<A href = "http://example.com/tillie" class = "sister" id = "link3"> Tillie </a>;
And they lived at the bottom of a well. 

... 
Ps: This column is still used

Downlink traversal:

. Contents subnode list, save all <tag> subnodes to the list

The iteration type of the. children subnode, similar to. contents, used to traverse the subnode cyclically.

. Descendants child node iteration type, including all child nodes, used for loop Traversal

Parallel traversal:

. Next_sibling returns the next parallel node label in HTML text order.

. Previus_sibling returns the label of the previous parallel node in the HTML text order.

. Next_siblings iteration type, returns all subsequent parallel node labels in HTML text order

for sibling in soup.a.next_sibling: print(sibling)

. Previus_siblings iteration type, returns all the parallel node labels that follow the HTML text order.

Because the output is too explosive, try it by myself. I won't map the last one, so I am too lazy to paste it (￣ _, ￣)

for sibling in soup.a.previous_sibling: print(sibling)

Uplink traversal:

Parent label of the. parent node

The iteration type of the parents node's parent label, which is used to traverse the parent node cyclically.

The Key Usage of the BeautifulSoup database is probably the Keyword parameter "find". The official documentation is very detailed. Here we will not list the usage of these Keyword parameters. The main roommates will be killed when they go to bed.

Give a small find column:

Output:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawlers of a salted fish (2): BeautifulSoup library and crawler beautifulsoup

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support