Python3 using Urllib to write crawlers

Last Update:2018-01-05 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a reptile?

Reptiles, also known as spiders, if the internet is likened to a spider's Web, Spider is a spider crawling on the internet. Web crawler is based on the address of the Web page to find the page, that is, the URL. To give a simple example, the string we enter in the address bar of the browser is the URL, for example: https://www.baidu.com

The URL is the consent Resource Locator (Uniform Resource Locator), and its general format is as follows (with square brackets [] as an option):

Protocol://hostname[:p ORT]/path/[;p Arameters][?query] #fragment

The format of the URL consists mainly of three parts:

1. protocol: The first part is the agreement, for example, Baidu is using the HTTPS protocol;
2. hostname[:p ort]: The second part is the host name (and the port number is optional parameters), the general website default port number is 80, such as Baidu's hostname is www.baidu.com, this is the server address;
3. Path: The third part is the specific address of the host resource, such as directory and file name.

Crawlers are based on URLs to obtain information about the Web.

Python3 's Urllib Bag

The modules provided by the Urllib package can be used to enable Python code to access URLs.

From the official Urllib package of Python3, we can know that the Python2 Urllib and urllib2 Two libraries are integrated into the Urllib package in the built-in library in Python3.

Only 4 modules are available in the Urllib package:

1. Urllib.request: Used to open or read URLs
2. Urllib.error: Exception information that contains Urllib.request
3. Urllib.parse: Contains the parameter information required by the URLs
4. Urllib.robotparser: Configuring the related features of the robots.txt file

Changes relative to Python 2

Used in pytho2.x is used in import urllib2 ---> python3.x import urllib.request urllib.error ,.
Used in pytho2.x in import urllib ---> python3.x, import urllib.request urllib.error
Used in pytho2.x is used import urlparse ---> in python3.x import urllib.parse .
Used in pytho2.x is used import urlopen ---> in python3.x import urllib.request.urlopen .
Used in pytho2.x is used import urlencode ---> in python3.x import urllib.parse.urlencode .
Use the corresponding in pytho2.x import urllib.quote ---> , which will be used in the python3.x import urllib.request.quote .
Use the corresponding in pytho2.x cookielib.CookieJar ---> , which will be http.CookieJar used in the python3.x.
Used in pytho2.x is used urllib2.Request ---> urllib.request.Request in python3.x.

Basic use

Urllib.request.urlopen (): Accesses a URL that returns an object containing information about the Web page

Response.read (): Gets the contents of the returned object

Response.getcode (): Gets the HTTP Code returned

Response.info (): Gets the metadata information returned, such as the HTTP Header

Response.geturl (): Gets the URL of the access

# Use Python to access the blog park, get web information import urllib.requestresponse = Urllib.request.urlopen (' Http://www.cnblogs.com/dachenzi ') data = Response.read (). Decode (' Utf-8 ') print (data)

Download a picture with Python

Import Urllib.requesturl = ' http://img.lenovomm.com/s3/img/app/app-img-lestore/ 2370-2015-07-16035439-1437033279327.jpg?iscompress=true&width=320&height=480&quantity=1&rotate =true ' response = Urllib.request.urlopen (URL) data = Response.read () with open (' img.jpg ', ' WB ') as F:  # Image input binary file, So just use B mode to open write to    f.write (data)

Note: Here urlopen can accept a str, or a Request object

Little Practice

Use Python to complete the translation applet, input Chinese to return English information, anyway.

# 1, use browser access, view network, determine the address of the submission data access

Customizing the HTTP Header

The HTTP header, which represents the header information that the browser carries when it makes an Access (HTTP request), what is a custom HTTP request header, and a chestnut: there are too many reptiles active on the web every day, and if the site is not restricted, then the traffic will be very high, Therefore, the basic site will be the basic crawler restrictions, and the use of user-agent (browser marking) is the most common way, using the browser and the use of Python code to access the site, the browser is marked differently.

My Google Browser is: mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36

Python code is: Python 3.6.3.

Since this is said to be customized, it is possible to modify the head parameters such as user-agent.

Modify User-agent

Modifying the requested user-agent requires customizing the request object and then passing the object to Urlopen for access

Import Urllib.requesturl = ' http://img.lenovomm.com/s3/img/app/app-img-lestore/ 2370-2015-07-16035439-1437033279327.jpg?iscompress=true&width=320&height=480&quantity=1&rotate =true ' head = {}head[' user-agent '] = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 ' request = urllib.request. Request (Url,headers=head)  # Creates a Request object and sets Headersresponse = Urllib.request.urlopen (request) data = Response.read () with open (' img.jpg ', ' WB ') as F:    f.write (data)

1  "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",2     "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)",3     "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",4     "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)",5     "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)",6     "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)",7     "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)",8     "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)",9     "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6",Ten     "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1", One     "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0", A     "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5", -     "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6", -     "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11", the     "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20", -     "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52",

more User-agentAnother way to add heades

In addition to defining headers in your code through a dictionary, you can also use the Add_header () method of the Request object to add

Import Urllib.requesturl = ' http://img.lenovomm.com/s3/img/app/app-img-lestore/ 2370-2015-07-16035439-1437033279327.jpg?iscompress=true&width=320&height=480&quantity=1&rotate =true ' # head = {}# head[' user-agent '] = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 ' request = urllib.request. Request (URL)  # Create a Request object and set Headersrequest.add_header (' User-agent ', ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 ') response = Urllib.reques T.urlopen (Request) data = Response.read () with open (' img.jpg ', ' WB ') as F:    f.write (data)

Python3 using Urllib to write crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More