Crawlers of a salted fish (2): BeautifulSoup library and crawler beautifulsoup

Source: Internet
Author: User

Crawlers of a salted fish (2): BeautifulSoup library and crawler beautifulsoup

To record the BeautifulSoup method.

BeautifulSoup and the previously mentioned requests libraries are both relatively practical python third-party libraries. By combining them with beginners, you can basically crawl small-scale data.

In the next article, write a small column. Let's talk about the BeautifulSoup Library first.

After installation, the next common library on the internet, ANACONDA, has an integrated development environment.

Or give an official document address: http://beautifulsoup.readthedocs.io/zh_CN/latest/

Import

from bs4 import BeautifulSoup
Html_doc = """
<Html>
<Body>
<P class = "title"> <B> The Dormouse's story </B> </p>

<P class = "story"> Once upon a time there were three little sisters; and their names were
<A href = "http://example.com/elsie" class = "sister" id = "link1"> Elsie </a>,
<A href = "http://example.com/lacie" class = "sister" id = "link2"> Lacie </a> and
<A href = "http://example.com/tillie" class = "sister" id = "link3"> Tillie </a>;
And they lived at the bottom of a well. </p>

<P class = "story">... </p>
"""
Here is the html demo page of the official document, which is more classic. Here we will use this for demonstration (well, I am actually lazy)
 
html_doc = """

<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
 

Output result:

<Html>
<Head>
<Title>
The Dormouse's story
</Title>
</Head>
<Body>
<P class = "title">
<B>
The Dormouse's story
</B>
</P>
<P class = "story">
Once upon a time there were three little sisters; and their names were
<A class = "sister" href = "http://example.com/elsie" id = "link1">
Elsie
</A>
,
<A class = "sister" href = "http://example.com/lacie" id = "link2">
Lacie
</A>
And
<A class = "sister" href = "http://example.com/tillie" id = "link3">
Tillie
</A>
;
And they lived at the bottom of a well.
</P>
<P class = "story">
...
</P>
</Body>
</Html>

Next we will introduce the parser

Soup = BeautifulSoup ('

Generally, the first version is better than python3.2 and later versions of html. parser.

Beautiful Soup basic class

Tag, the most basic information organization unit, respectively, with <> and </> to indicate the beginning and end

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')tag = soup.btype(tag)# <class 'bs4.element.Tag'>

Any tag that exists in the HTML syntax can be obtained by using soup. <tag>

When multiple <tag> corresponding content exists in the HTML document, soup. <tag> returns the first

TAG attributes

TAG Name: the Name of the TAG, <p>... </P> the name is 'P'. Format: <tag>. name

 

Each <tag> has its own name, obtained through <tag>. name, string type

Attribute of TAG Attributes tag, which is in dictionary format. Format: <TAG>. attrs

One <tag> can have 0 or more attributes, Dictionary type

The TAG NavigableString TAG contains non-attribute strings. <>... </> String in the format of <tag>. string

NavigableString can span multiple layers

TAG CommentObject is a special typeNavigableStringObject:

 

Bs-based HTML content Traversal method

 

Html_doc = """
<Html>
<Body>
<P class = "title"> <B> The Dormouse's story </B> </p>

<P class = "story"> Once upon a time there were three little sisters; and their names were
<A href = "http://example.com/elsie" class = "sister" id = "link1"> Elsie </a>,
<A href = "http://example.com/lacie" class = "sister" id = "link2"> Lacie </a> and
<A href = "http://example.com/tillie" class = "sister" id = "link3"> Tillie </a>;
And they lived at the bottom of a well. </p>

<P class = "story">... </p>
Ps: This column is still used

 

 

 

Downlink traversal:

. Contents subnode list, save all <tag> subnodes to the list

The iteration type of the. children subnode, similar to. contents, used to traverse the subnode cyclically.

. Descendants child node iteration type, including all child nodes, used for loop Traversal

 

Parallel traversal:

. Next_sibling returns the next parallel node label in HTML text order.

. Previus_sibling returns the label of the previous parallel node in the HTML text order.

. Next_siblings iteration type, returns all subsequent parallel node labels in HTML text order

for sibling in soup.a.next_sibling:        print(sibling)

. Previus_siblings iteration type, returns all the parallel node labels that follow the HTML text order.

Because the output is too explosive, try it by myself. I won't map the last one, so I am too lazy to paste it ( ̄ _,  ̄)

for sibling in soup.a.previous_sibling:        print(sibling)

Uplink traversal:

Parent label of the. parent node


The iteration type of the parents node's parent label, which is used to traverse the parent node cyclically.

 

The Key Usage of the BeautifulSoup database is probably the Keyword parameter "find". The official documentation is very detailed. Here we will not list the usage of these Keyword parameters. The main roommates will be killed when they go to bed.

Give a small find column:

Output:

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.