The beautfiulsoup of Python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

BeautifulSoup converts an HTML document into a property structure, with each node being a Python object. This allows us to operate on each node. Refer to the following code

Parse_url ():
Try:
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Html=beautifulsoup (FD)

Urllib2. Urlerror,e:
E

The BeautifulSoup in the Urlopen is the HTML page of feedback. But there is a hint

E:\python2.7.11\python.exe e:/py_prj/test.py

E:\python2.7.11\lib\site-packages\bs4\__init__.py:181:userwarning:no parser was explicitly specified, so I ' m using the Best available HTML parser for the This system ("lxml"). This usually isn ' t a problem, but if you run the this code on another system, or in a different virtual environment, it may us e a different parser and behave differently.

The code that caused this warning are on line and the file e:/py_prj/test.py. To get rid of the This warning, the change code is looks like this:

BeautifulSoup ([Your markup])

To this:

BeautifulSoup ([Your markup], "lxml")

Markup_type=markup_type))

This hint means that there is no way to pass a parsing page to the BeautifulSoup. There are 2 ways to use: Html.parser and lxml. Here we'll use the Html.parser,lxml to talk later. Change the code to the following, OK.

Html=beautifulsoup (FD,"Html.parser")

Before parsing the Web page, let's look at a few concepts, tags, attributes.

For example, the following page structure. <a href= "1.shtml" > section </a> where A is the tag, the href is the attribute. The first section is the contents of the label

The methods for finding properties in BeautifulSoup are as follows:

Html.meta.encode (' GBK ')
Html.meta.attrs

Combine the following code to find the element that tag is meta. and print all the meta properties:

The results are as follows:

E:\python2.7.11\python.exe e:/py_prj/test.py

{u ' content ': U ' text/html; charset=gb2312 ', U ' http-equiv ': U ' content-type '}

If you want to get an attribute, you can do this in the following way:

html.meta.attrs[' content '] output is text/html

What if we want the content of the tag, which is the text, like this way?

Html.title.string.encode (' GBK '). The function of string is to get the text corresponding to the label

But the above method can only find the first satisfying tag, if there are multiple tags of the same name in the page how to distinguish, such as the following scenario: There are multiple spans and a label

Then you need another way to get it. The following code uses the Find_all method to get all the structures labeled A and print them out

Html.find_all (' a '):
A.encode (' GBK ')

The results are as follows, because too many of them are listed only as part of this.

If you want to get the contents of these nodes you can use the Get_text () method. as follows:

Html.find_all (' a '):
 a.get_text ()

If you want to get all the properties of these nodes, you can use the following method:

Html.find_all (' a '):
A.attrs

If you want to get the value of a property, the previous a.attrs returns a dictionary. For example, to get the value of the class attribute, use the following method

Html.find_all (' a '):
a.attrs[' class ']

The Find_all method can also give the lookup a qualified value: for example, to get the <a href= "1.shtml" > Label as shown below

Here the first parameter represents the tag name, the second parameter represents the property name

Html.find_all (' A ',href="1.shtml"):
A.encode (' GBK ')

You can also set multiple parameter lookups, such as finding the label for a form

Html.find_all (' form ',method="POST",target="_blank" ) ):
A.encode (' GBK ')

Of course, in the search can also use regular expressions, such as Re.complie ("a.*") and other methods

You can also limit the number of lookups: The following expression is the first 5 search results

Html.find_all (' A ', limit=5):
a.attrs[' class ']

The Find family also has find_parents/find_parent to locate the parent node. find_next_siblings ()/find_next_sibling () find the next brother and

Find_previous_siblings ()/find_previous_sibling () finds the previous sibling node.

The beautfiulsoup of Python web crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More