What is a reptile?
Reptiles, also known as spiders, if the internet is likened to a spider's Web, Spider is a spider crawling on the internet. Web crawler is based on the address of the Web page to find the page, that is, the URL. To give a simple example, the string we enter in the address bar of the browser is the URL, for example: https://www.baidu.com
The URL is the consent Resource Locator (Uniform Resource Locator), and its general format is as follows (with square brackets [] as an option):
Protocol://hostname[:p ORT]/path/[;p Arameters][?query] #fragment
The format of the URL consists mainly of three parts:
-
- protocol: The first part is the agreement, for example, Baidu is using the HTTPS protocol;
- hostname[:p ort]: The second part is the host name (and the port number is optional parameters), the general website default port number is 80, such as Baidu's hostname is www.baidu.com, this is the server address;
- Path: The third part is the specific address of the host resource, such as directory and file name.
Crawlers are based on URLs to obtain information about the Web.
Python3 's Urllib Bag
The modules provided by the Urllib package can be used to enable Python code to access URLs.
From the official Urllib package of Python3, we can know that the Python2 Urllib and urllib2 Two libraries are integrated into the Urllib package in the built-in library in Python3.
Only 4 modules are available in the Urllib package:
-
- Urllib.request: Used to open or read URLs
- Urllib.error: Exception information that contains Urllib.request
- Urllib.parse: Contains the parameter information required by the URLs
- Urllib.robotparser: Configuring the related features of the robots.txt file
Changes relative to Python 2
- Used in pytho2.x is used in
import urllib2 --->
python3.x import urllib.request
urllib.error
,.
- Used in pytho2.x in
import urllib --->
python3.x, import urllib.request
urllib.error
- Used in pytho2.x is used
import urlparse --->
in python3.x import urllib.parse
.
- Used in pytho2.x is used
import urlopen --->
in python3.x import urllib.request.urlopen
.
- Used in pytho2.x is used
import urlencode --->
in python3.x import urllib.parse.urlencode
.
- Use the corresponding in pytho2.x
import urllib.quote --->
, which will be used in the python3.x import urllib.request.quote
.
- Use the corresponding in pytho2.x
cookielib.CookieJar --->
, which will be http.CookieJar
used in the python3.x.
- Used in pytho2.x is used
urllib2.Request --->
urllib.request.Request
in python3.x.
Basic use
Urllib.request.urlopen (): Accesses a URL that returns an object containing information about the Web page
Response.read (): Gets the contents of the returned object
Response.getcode (): Gets the HTTP Code returned
Response.info (): Gets the metadata information returned, such as the HTTP Header
Response.geturl (): Gets the URL of the access
# Use Python to access the blog park, get web information import urllib.requestresponse = Urllib.request.urlopen (' Http://www.cnblogs.com/dachenzi ') data = Response.read (). Decode (' Utf-8 ') print (data)
Download a picture with Python
Import Urllib.requesturl = ' http://img.lenovomm.com/s3/img/app/app-img-lestore/ 2370-2015-07-16035439-1437033279327.jpg?iscompress=true&width=320&height=480&quantity=1&rotate =true ' response = Urllib.request.urlopen (URL) data = Response.read () with open (' img.jpg ', ' WB ') as F: # Image input binary file, So just use B mode to open write to f.write (data)
Note: Here urlopen can accept a str, or a Request object
Little Practice
Use Python to complete the translation applet, input Chinese to return English information, anyway.
# 1, use browser access, view network, determine the address of the submission data access
Customizing the HTTP Header
The HTTP header, which represents the header information that the browser carries when it makes an Access (HTTP request), what is a custom HTTP request header, and a chestnut: there are too many reptiles active on the web every day, and if the site is not restricted, then the traffic will be very high, Therefore, the basic site will be the basic crawler restrictions, and the use of user-agent (browser marking) is the most common way, using the browser and the use of Python code to access the site, the browser is marked differently.
My Google Browser is: mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36
Python code is: Python 3.6.3.
Since this is said to be customized, it is possible to modify the head parameters such as user-agent.
Modify User-agent
Modifying the requested user-agent requires customizing the request object and then passing the object to Urlopen for access
Import Urllib.requesturl = ' http://img.lenovomm.com/s3/img/app/app-img-lestore/ 2370-2015-07-16035439-1437033279327.jpg?iscompress=true&width=320&height=480&quantity=1&rotate =true ' head = {}head[' user-agent '] = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 ' request = urllib.request. Request (Url,headers=head) # Creates a Request object and sets Headersresponse = Urllib.request.urlopen (request) data = Response.read () with open (' img.jpg ', ' WB ') as F: f.write (data)
1 "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",2 "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)",3 "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",4 "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)",5 "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)",6 "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)",7 "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)",8 "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)",9 "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6",Ten "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1", One "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0", A "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5", - "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6", - "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11", the "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20", - "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52",
more User-agentAnother way to add heades
In addition to defining headers in your code through a dictionary, you can also use the Add_header () method of the Request object to add
Import Urllib.requesturl = ' http://img.lenovomm.com/s3/img/app/app-img-lestore/ 2370-2015-07-16035439-1437033279327.jpg?iscompress=true&width=320&height=480&quantity=1&rotate =true ' # head = {}# head[' user-agent '] = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 ' request = urllib.request. Request (URL) # Create a Request object and set Headersrequest.add_header (' User-agent ', ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 ') response = Urllib.reques T.urlopen (Request) data = Response.read () with open (' img.jpg ', ' WB ') as F: f.write (data)
Python3 using Urllib to write crawlers