"Python Network data Acquisition" Reading notes (i)

Source: Internet
Author: User
Tags virtual environment

The usual idea when thinking about "web crawlers":

? Get HTML data from a website domain name

? Parsing data based on target information

? Storing target information

? If necessary, move to another page to repeat the process


When a Web browser encounters a tag, such as


1, the first Knowledge urllib library

Urllib is the standard library, in Python 3.x, Urllib2 renamed Urllib, divided into sub-modules: Urllib.request, Urllib.parse and Urllib.error.

The Urlopen is used to open and read a remote object obtained from the network.

Import Urlopen, and then call Html.read () to get the HTML content of the Web page.

>>> from urllib.request import urlopen>>> html = urlopen ( "Http://pythonscraping.com/pages/page1.html") >>> print (Html.read ()) B ' 


2. Using Virtual Environment

You can save a library file with a virtual environment.

As follows: Create a new environment called scrapingenv, then activate it, in the new scrapingenv environment, you can install and use BeautifulSoup, and finally exit the environment by releasing the command.


$ virtualenv scrapingenv$ cd scrapingenv/$ source Bin/activate (scrapingenv) ryan$ pip install Beautifulsoup4 (scrapingenv ) ryan$ python> from BS4 import beautifulsoup> (scrapingenv) ryan$ deactivate$

You can use any of the following commands to install the BeautifulSoup library

PIP3 Install BS4PIP3 Install BEAUTIFULSOUP4


3, the first Knowledge BeautifulSoup library

The HTML content is uploaded to the BeautifulSoup object (Html.parser is a built-in parser), and the

>>> from urllib.request import urlopen>>> from BS4 import beautifulsoup>>> html = Urlopen ("htt P://pythonscraping.com/pages/page1.html ") >>> bsobj = BeautifulSoup (Html.read (), ' Html.parser ') >> > Print (BSOBJ.H1) 

* All of the following function calls can produce the same result:

Bsobj.h1

BsObj.html.body.h1

BsObj.body.h1

BsObj.html.h1


4. Handling Exceptions

The Urlopen function throws an "Httperror" exception if the page does not exist on the server (or if there is an error getting the page).

If the server does not exist (that is, the link cannot be opened or the URL is incorrectly written), Urlopen returns a None object.

Handle it in the following ways:

try:html = Urlopen ("http://pythonscraping.com/pages/page1.html") except Httperror as E:print (e) # interrupts the program, or executes another scenario Else:if HTML is none:print ("URLs are not found") Else: # program continues pass


Calling a label in the BeautifulSoup object does not exist to return the None object.

The Attributeerror error occurs when you call the sub-label below the None object.

Handle it in the following ways:

Try:bsobj = BeautifulSoup (Html.read (), ' html.parser ') badcontent = bsObj.body.h2except Attributeerror as E:prin T ("tag is not found") else:if badcontent = = None:print ("tag is not Found") Else:print (badcontent)



5. Re-organize the above code

# -*- coding: utf-8 -*-from urllib.request import urlopenfrom  Urllib.error import httperrorfrom bs4 import beautifulsoupdef gettitle (URL):     try:        html = urlopen (URL)             except httperror as e:         return None    try:         bsobj = beautifulsoup (Html.read (),  ' Html.parser ')          title = bsObj.body.h1    except  attributeerror as e:        return none     return title        title = gettitle ("http ://pythonscraping.com/pages/page1.html ") IF&NBsp;title == none:    print ("Title could not be found") Else:     print (title)

We created a gettitle function that returns the title of the page. If you encounter a problem when you get a webpage, return a None object.

Inside the GetTitle function, we checked the httperror as before, and then encapsulated the two lines of BeautifulSoup code inside a try statement. Any row in either row has a problem, Attributeerror can be thrown (if the server does not exist, HTML is a None object, Html.read () will throw Attributeerror).


"Python Network data Acquisition" Reading notes (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.