Web scraping with Python chapter I.

Source: Internet
Author: User

1. Understanding Urllib

Urllib is a standard library of Python that provides rich functions such as requesting data from a Web server, processing cookies, and corresponding URLLIB2 libraries in Python2, unlike Urllib2, Python3 Urllib is divided into several sub-modules: Urllib.request, Urllib.parse, Urllib.error, etc., the use of Urllib Library can refer to https://docs.python.org/3/ Library/urllib.html

fromimport= urlopen("http://pythonscraping.com/pages/page1.html")print(html.read())
b‘
2. Understanding BeautifulSoup

The BeautifulSoup library is used to parse HTML text and convert it into a BeautifulSoup object.

fromimport urlopenfromimport= urlopen("http://www.pythonscraping.com/pages/page1.html"= BeautifulSoup(html.read(),"lxml")print(bsObj.h1)

The BeautifulSoup function requires a parsing library, and the following table lists several common analytic libraries and gives advantages and disadvantages:

Parse Library How to use Advantages Disadvantage
Python Standard library BeautifulSoup (HTML, ' Html.parser ') Python built-in standard library, fast execution speed Poor fault-tolerant capability
lxml HTML Parsing Library BeautifulSoup (HTML, ' lxml ') Fast, high fault tolerance Need to install, need C language Library
lxml XML Parsing Library BeautifulSoup (html,[' lxml ', ' xml ') Fast, strong fault tolerance; Support XML format Requires C language Library
Htm5lib Parsing Library BeautifulSoup (HTML, ' Htm5llib ') Browser-based parsing, best-in-the-way fault tolerance Slow speed
3. Reliability Crawler

We know that the 404 Page not Found is usually present in the site visit, or the server is temporarily shut down, an exception is thrown when the Urlopen function is called, so the program cannot continue, and we can urllib.error the module to handle the exception.

fromimport urlopenfromimport URLErrortry:    = urlopen("https://www.baid.com/")   #url is wrongexceptas e:    print(e)
<urlopen error [Errno 111] Connection refused>

After obtaining a reliable connection, we use BeautifulSoup to process the HTML, which usually results in a situation where a label cannot be found after the site is revised to throw an exception.

fromimport urlopenfromimport= urlopen("http://www.pythonscraping.com/pages/page1.html")try:    = BeautifulSoup(html.read(),"lxml")    = bsObj.ul.li    print(li)exceptAttributeErroras e:    print(e)
‘NoneType‘ object has no attribute ‘li‘
4. First Reptile Program
 fromUrllib.requestImportUrlopen fromUrllib.errorImportHttperror fromBs4ImportBeautifulSoupdefGetTitle (URL):Try: HTML=Urlopen (URL)exceptHttperror asE:return None        Try: Bsobj=BeautifulSoup (Html.read (),"lxml") title=BsObj.body.h1except Attributeerror  asE:return None        returnTitletitle=GetTitle ("Http://www.pythonscraping.com/pages/page1.html")ifTitle== None:Print("Title could not being found.")Else:Print(title)

Web scraping with Python chapter I.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.