The beautfiulsoup of Python web crawler

Source: Internet
Author: User
Tags tag name virtual environment python web crawler

BeautifulSoup converts an HTML document into a property structure, with each node being a Python object. This allows us to operate on each node. Refer to the following code
Parse_url ():
Try:
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Html=beautifulsoup (FD)

Urllib2. Urlerror,e:
E
The BeautifulSoup in the Urlopen is the HTML page of feedback. But there is a hint
E:\python2.7.11\python.exe e:/py_prj/test.py
E:\python2.7.11\lib\site-packages\bs4\__init__.py:181:userwarning:no parser was explicitly specified, so I ' m using the Best available HTML parser for the This system ("lxml"). This usually isn ' t a problem, but if you run the this code on another system, or in a different virtual environment, it may us e a different parser and behave differently.
The code that caused this warning are on line and the file e:/py_prj/test.py. To get rid of the This warning, the change code is looks like this:
BeautifulSoup ([Your markup])
To this:
BeautifulSoup ([Your markup], "lxml")
Markup_type=markup_type))

This hint means that there is no way to pass a parsing page to the BeautifulSoup. There are 2 ways to use: Html.parser and lxml. Here we'll use the Html.parser,lxml to talk later. Change the code to the following, OK.

Html=beautifulsoup (FD,"Html.parser")

Before parsing the Web page, let's look at a few concepts, tags, attributes.

For example, the following page structure. <a href= "1.shtml" > section </a> where A is the tag, the href is the attribute. The first section is the contents of the label

The methods for finding properties in BeautifulSoup are as follows:

Html.meta.encode (' GBK ')
Html.meta.attrs
Combine the following code to find the element that tag is meta. and print all the meta properties:

The results are as follows:

E:\python2.7.11\python.exe e:/py_prj/test.py

<meta content= "text/html; CHARSET=GBK "http-equiv=" Content-type "/>

{u ' content ': U ' text/html; charset=gb2312 ', U ' http-equiv ': U ' content-type '}

If you want to get an attribute, you can do this in the following way:

html.meta.attrs[' content '] output is text/html

What if we want the content of the tag, which is the text, like this way?

Html.title.string.encode (' GBK '). The function of string is to get the text corresponding to the label
But the above method can only find the first satisfying tag, if there are multiple tags of the same name in the page how to distinguish, such as the following scenario: There are multiple spans and a label 

Then you need another way to get it. The following code uses the Find_all method to get all the structures labeled A and print them out 
Html.find_all (' a '):
A.encode (' GBK ')
The results are as follows, because too many of them are listed only as part of this.

If you want to get the contents of these nodes you can use the Get_text () method. as follows:
Html.find_all (' a '):
a.get_text ()
If you want to get all the properties of these nodes, you can use the following method:
Html.find_all (' a '):
A.attrs
If you want to get the value of a property, the previous a.attrs returns a dictionary. For example, to get the value of the class attribute, use the following method 
Html.find_all (' a '):
a.attrs[' class ']
The Find_all method can also give the lookup a qualified value: for example, to get the <a href= "1.shtml" > Label as shown below

Here the first parameter represents the tag name, the second parameter represents the property name
Html.find_all (' A ',href="1.shtml"):
A.encode (' GBK ')
You can also set multiple parameter lookups, such as finding the label for a form

Html.find_all (' form ',method="POST",target="_blank" ) ):
A.encode (' GBK ')
Of course, in the search can also use regular expressions, such as Re.complie ("a.*") and other methods
You can also limit the number of lookups: The following expression is the first 5 search results
Html.find_all (' A ', limit=5):
a.attrs[' class ']
The Find family also has find_parents/find_parent to locate the parent node. find_next_siblings ()/find_next_sibling () find the next brother and 
Find_previous_siblings ()/find_previous_sibling () finds the previous sibling node.

The beautfiulsoup of Python web crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.