"Python Network data Acquisition" Reading notes (i)

Last Update:2018-03-29 Source: Internet

Author: User

Tags virtual environment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The usual idea when thinking about "web crawlers":

? Get HTML data from a website domain name

? Parsing data based on target information

? Storing target information

? If necessary, move to another page to repeat the process

When a Web browser encounters a tag, such as

1, the first Knowledge urllib library

Urllib is the standard library, in Python 3.x, Urllib2 renamed Urllib, divided into sub-modules: Urllib.request, Urllib.parse and Urllib.error.

The Urlopen is used to open and read a remote object obtained from the network.

Import Urlopen, and then call Html.read () to get the HTML content of the Web page.

>>> from urllib.request import urlopen>>> html = urlopen ( "Http://pythonscraping.com/pages/page1.html") >>> print (Html.read ()) B ' 

2. Using Virtual Environment
You can save a library file with a virtual environment.
As follows: Create a new environment called scrapingenv, then activate it, in the new scrapingenv environment, you can install and use BeautifulSoup, and finally exit the environment by releasing the command.

$ virtualenv scrapingenv$ cd scrapingenv/$ source Bin/activate (scrapingenv) ryan$ pip install Beautifulsoup4 (scrapingenv ) ryan$ python> from BS4 import beautifulsoup> (scrapingenv) ryan$ deactivate$
You can use any of the following commands to install the BeautifulSoup library
PIP3 Install BS4PIP3 Install BEAUTIFULSOUP4

3, the first Knowledge BeautifulSoup library
The HTML content is uploaded to the BeautifulSoup object (Html.parser is a built-in parser), and the 
>>> from urllib.request import urlopen>>> from BS4 import beautifulsoup>>> html = Urlopen ("htt P://pythonscraping.com/pages/page1.html ") >>> bsobj = BeautifulSoup (Html.read (), ' Html.parser ') >> > Print (BSOBJ.H1) 
* All of the following function calls can produce the same result:
Bsobj.h1
BsObj.html.body.h1
BsObj.body.h1
BsObj.html.h1

4. Handling Exceptions
The Urlopen function throws an "Httperror" exception if the page does not exist on the server (or if there is an error getting the page).
If the server does not exist (that is, the link cannot be opened or the URL is incorrectly written), Urlopen returns a None object.
Handle it in the following ways:
try:html = Urlopen ("http://pythonscraping.com/pages/page1.html") except Httperror as E:print (e) # interrupts the program, or executes another scenario Else:if HTML is none:print ("URLs are not found") Else: # program continues pass

Calling a label in the BeautifulSoup object does not exist to return the None object.
The Attributeerror error occurs when you call the sub-label below the None object.
Handle it in the following ways:
Try:bsobj = BeautifulSoup (Html.read (), ' html.parser ') badcontent = bsObj.body.h2except Attributeerror as E:prin T ("tag is not found") else:if badcontent = = None:print ("tag is not Found") Else:print (badcontent)


5. Re-organize the above code
# -*- coding: utf-8 -*-from urllib.request import urlopenfrom  Urllib.error import httperrorfrom bs4 import beautifulsoupdef gettitle (URL):     try:        html = urlopen (URL)             except httperror as e:         return None    try:         bsobj = beautifulsoup (Html.read (),  ' Html.parser ')          title = bsObj.body.h1    except  attributeerror as e:        return none     return title        title = gettitle ("http ://pythonscraping.com/pages/page1.html ") IF&NBsp;title == none:    print ("Title could not be found") Else:     print (title)
We created a gettitle function that returns the title of the page. If you encounter a problem when you get a webpage, return a None object.
Inside the GetTitle function, we checked the httperror as before, and then encapsulated the two lines of BeautifulSoup code inside a try statement. Any row in either row has a problem, Attributeerror can be thrown (if the server does not exist, HTML is a None object, Html.read () will throw Attributeerror).

"Python Network data Acquisition" Reading notes (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More