Python crawler (i)

Source: Internet
Author: User

1. A simple Web Access

Urllib is a standard Python library (meaning that you don't need to install any attachments to run the demo), including a way to request data over the network, process cookies, and even change metadata such as headers and user agents.

Urlopen This method is used to access remote data over the network, which is to send a GET request to the established URL address.

1  from Import Urlopen 2 html = urlopen ('http://pythonscraping.com/pages/page1.html')3  Print(Html.read ())
View Code

2. Brief introduction of BeautifulSoup

This module is a html/xml parser that can handle non-canonical tags and generate parse trees very well.

I Installing BeautifulSoup

Linux:apt-get Install PYTHON-BS4 or pip install Beautifulsoup4

II in this module (BS4) The BeautifulSoup is the most commonly used

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 4html = Urlopen ("http://www.pythonscraping.com/pages/page1.html")5Bsobj = BeautifulSoup (Html.read (),'Html.parser')6 Print(BSOBJ.H1)7 8 #output: 
View Code

3. Exception Handling

I did not find the requested address on the service an HTTP error error will be returned this error may be 404 Page not found Internal server error All of these conditions Urlopen method throws an exception Httperror

1 Try:2html = Urlopen ("http://www.pythonscraping.com/exercises/exercise1.html")3 exceptHttperror as E:4     Print(e)5      #return Null,break or do some other ' plan B '6 Else:7     #Program continues

II Service not found, Urlopen will return none

1 if  is None: 2     Print ("URL is not found") 3 Else : 4     # Program continues

A notation for capturing various exceptions:

1  fromUrllib.requestImportUrlopen2  fromUrllib.errorImportHttperror3  fromBs4ImportBeautifulSoup4 defgetTitle (URL):5     Try:6HTML =urlopen (URL)7     exceptHttperror as E:8       returnNone9     Try:TenBsobj =BeautifulSoup (Html.read ()) Onetitle =bsObj.body.h1 A     exceptAttributeerror as E: -       returnNone -     returntitle thetitle = GetTitle ("http://www.pythonscraping.com/exercises/exercise1.html") - iftitle = =None: -    Print("Title could not being found") - Else: +     Print(title)

Python crawler (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.