2017.07.24 Python web crawler urllib2 Modify Header

Source: Internet
Author: User
Tags python web crawler

1.urllib2 Modify Header:

(1) In the use of web crawlers, some sites do not like to be accessed by the program (non-human access), will check the "identity card" of the connector, by default, urllib2 his own version number python-urllib2/x.y as his "ID number" to pass the check, This ID number can make the site a little confusing, or simply not work

(2) The Python program can be used as a browser to visit the site, the website is sent through the browser user-agent value to confirm the browser identity, create a request object with URLLIB2, and give it a dictionary containing the header data, Modify User-agent spoof web site , in general, change the user-agent to Internet Explorer is the safest

Add:

The user agent Chinese name is called "UA", which is a special string header that allows the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, browser language, Browser plugins , etc.

Some websites often judge the UA to send different pages to different operating systems, different browsers, which may cause some pages to not display properly in a browser, but can bypass detection by disguising the UA.

  The UA string of the browser

   the standard format is: Browser identification (operating system identity; encryption level identification; browser language) rendering engine identity version information

  

(3) All common user-agent are put in a useragents.py file, saved in a dictionary, convenient for later use as module import:

#!/usr/bin/env python
#-*-Coding:utf-8-*-


Pcuseragent = {
"Safari 5.1–mac": "user-agent:mozilla/5.0 (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50 ",
"Safari 5.1–windows": "user-agent:mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50 ",
"IE 9.0": "user-agent:mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0); ",
"IE 8.0": "User-agent:mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) ",
"IE 7.0": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) ",
"IE 6.0": "User-agent:mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) ",
"Firefox 4.0.1–mac": "user-agent:mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) gecko/20100101 firefox/4.0.1 ",
"Firefox 4.0.1–windows": "user-agent:mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1 ",
"Opera 11.11–mac": "user-agent:opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U EN) presto/2.8.131 version/11.11 ",
"Opera 11.11–windows": "user-agent:opera/9.80 (Windows NT 6.1; U EN) presto/2.8.131 version/11.11 ",
"Chrome 17.0–mac": "user-agent:mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11 ",
"Maxthon": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0) ",
"Tencent TT": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Tencenttraveler 4.0) ",
"The World 2.x": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) ",
"The World 3.x": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The world) ",
"Sogou 1.x": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) ",
"user-agent:mozilla/4.0": "Compatible; MSIE 7.0; Windows NT 5.1; 360SE) ",
"Avant": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser) ",
"Green Browser": "User-agent:mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) "
}

Mobileuseragent = {
"IOS 4.33–iphone": "user-agent:mozilla/5.0 (IPhone; U CPU iPhone os 4_3_3 like Mac os X; En-US) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8j2 safari/6533.18.5 ",
"IOS 4.33–ipod Touch": "user-agent:mozilla/5.0 (IPod; U CPU iPhone os 4_3_3 like Mac os X; En-US) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8j2 safari/6533.18.5 ",
"IOS 4.33–ipad": "user-agent:mozilla/5.0 (IPad; U CPU os 4_3_3 like Mac os X; En-US) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8j2 safari/6533.18.5 ",
"Android N1": "user-agent:mozilla/5.0 (Linux; U Android 2.3.7; En-us; Nexus one build/frf91) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 ",
"Android QQ": "User-agent:mqqbrowser/26 mozilla/5.0 (Linux; U Android 2.3.7; ZH-CN; MB200 build/grj22; CyanogenMod-7) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 ",
"Android Opera": "User-agent:opera/9.80 (Android 2.3.4; Linux; Opera mobi/build-1107180945; U EN-GB) presto/2.8.149 version/11.10 ",
"Android Pad Moto Xoom": "user-agent:mozilla/5.0 (Linux; U Android 3.0; En-us; Xoom build/hri39) applewebkit/534.13 (khtml, like Gecko) version/4.0 safari/534.13 ",
"BlackBerry": "user-agent:mozilla/5.0 (BlackBerry; U BlackBerry 9800; EN) applewebkit/534.1+ (khtml, like Gecko) version/6.0.0.337 Mobile safari/534.1+ ",
"WebOS HP Touchpad": "user-agent:mozilla/5.0 (Hp-tablet; Linux; hpwos/3.0.0; U En-US) applewebkit/534.6 (khtml, like Gecko) wosbrowser/233.70 safari/534.6 touchpad/1.0 ",
"Nokia N97": "user-agent:mozilla/5.0 (symbianos/9.4; series60/5.0 nokian97-1/20.0.019; profile/midp-2.1 configuration/cldc-1.1) applewebkit/525 (khtml, like Gecko) browserng/7.1.18124 ",
"Windows Phone Mango": "user-agent:mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; trident/5.0; iemobile/9.0; HTC; Titan) ",
"UC": "user-agent:ucweb7.0.2.37/28/999",
"UC standard": "user-agent:nokia5700/ucweb7.0.2.37/28/999",
"Ucopenwave": "user-agent:openwave/ucweb7.0.2.37/28/999",
"UC Opera": "User-agent:mozilla/4.0 (compatible; MSIE 6.0; ) opera/ucweb7.0.2.37/28/999 "
}

(4) Write testurllib2modifyheader.py, test urllib2 Modify Header:

#!/usr/bin/env Python
#-*-Coding:utf-8-*-

Import Urllib2
Import useragents

"" "Useragents is a custom module, located in the current directory" ""

Class Urllib2modifyheader (object):
def __init__ (self):
"" This is the PC-side +ie user-agent "" "
Piua=useragents.pcuseragent.get (' IE 9.0 ')
"" This is the mobile end of the +uc user-agent "" "
Muua=useragents.mobileuseragent.get (' UC standard ')
#测试用的网站选择的是有道翻译
Self.url= ' http://fanyi.youdao.com '

Self.useuseragent (piua,1)
Self.useuseragent (muua,2)

def useuseragent (self,useragent,name):
Request=urllib2. Request (Self.url)

Request.add_header (Useragent.split (': ') [0],useragent.split (': ') [1])
Response=urllib2.urlopen (Request)
FILENAME=STR (name) + '. html '
With open (FileName, ' a ') as FP:
Fp.write ("%s \ n \%useragent")
Fp.write (Response.read ())

if __name__ = = ' __main__ ':
Umh=urllib2modifyheader ()

perform the results to get 1.html and 2.html:

Code Explanation:

(1) urllib2. Request:Urllib2.urlopen can accept a request object or URL, (when accepting the request object, and can be used to set a URL of headers)

  Class Urllib2. Request (url[, data][, headers][, originreqhost][, unverifiable])

The request class is an abstraction for URL requests.

The 5 parameters are described below:

ii.2.1.1:url--is a string that contains a valid URL.

ii.2.1.2:data--is a string that specifies the additional data that is sent to the server, if no data needs to be sent can be "None". HTTP requests that currently use data are unique. When the request contains the data parameter, the HTTP request is post, not get.

The data should be cached in a standard application/x-www-form-urlencoded format. The urllib.urlencode() function uses a map or a 2-tuple to return a string of this format. It is popular to say that if you want to send data to a URL (usually this data represents some CGI scripts or other Web applications).

For example, when the form is filled in online, the browser will post the contents of the form, which needs to be encoded in a standard format (encode), and then passed as a data parameter to the Request object. Examples are as follows:

ii.2.1.3 headers--is a dictionary type, the header dictionary can be passed in as a parameter directly to the request, or you can add each key and value as a parameter by calling the Add_header () method.

The user-agent header, which identifies the browser, is often used for spoof and spoofing, because some HTTP services allow only certain requests to come from common browsers rather than scripts, or to return different versions for different browsers.

For example, Mozilla Firefox browser is recognized as "mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 ". By default, URLIB2 identifies itself as python-urllib/x.y (where XY is the major or minor version number of the Python release, as in Python 2.6, the Default user-agent string for URLLIB2 is "python-urllib/ 2.6.

Learn more URLLIB2 Reference blog : http://blog.csdn.net/howeblue/article/details/47426265

2017.07.24 Python web crawler urllib2 Modify Header

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.