Reference: http://www.cnblogs.com/xin-xin/p/4297852.html
First, Introduction
Crawler is a web crawler, if the Internet than to make a big net, then spiders are reptiles. If it encounters a resource, it will crawl down.
Second, the process
When we browse the Web page, we often see a variety of pages, in fact, this process is we enter the URL, the DNS resolution to the corresponding IP to find the corresponding server host, to the server issued a request, the server after parsing will html,js, etc. sent back to the browser display.
In fact, the crawler and the process is similar, but we crawl to the HTML, through the regular expression to determine what to get.
Third, the use of Urllib library
1. Grasp the HTML of the page:
# !/usr/bin/python # -*-coding:utf-8-*- Import 'http://www.baidu.com'== response.read () Print HTML
2. Structuring the request
For example, the above code can be rewritten like this:
# !/usr/bin/python # -*-coding:utf-8-*- Import 'http://www.baidu.com'== = Response.read ()print html
Transmission of 3.GET and post data
POST:
Note: Only the Demo method because the website also has the header cookie and so on authentication code does not log in
#!/usr/bin/python#-*-coding:utf-8-*-Importurllib,urllib2values= {"username":"xxxxxx","Password":"xxxxxx"}data=urllib.urlencode (values) URL="http://www.xiyounet.org/checkout/"Request=Urllib2. Request (url,data) reponse=Urllib2.urlopen (Request)PrintReponse.read ()
GET:
#!/usr/bin/python#-*-coding:utf-8-*-Importurllib,urllib2values= {"username":"xxxxxx","Password":"xxxxxx"}data=urllib.urlencode (values) URL="http://www.xiyounet.org/checkout/"Geturl= URL +"?"+datarequest=Urllib2. Request (Geturl) reponse=Urllib2.urlopen (Request)PrintReponse.read ()
4. Set headers
Since most websites do not log on as above, in order to be able to simulate the browser more fully, it is necessary to learn the header
#!/usr/bin/python#-*-coding:utf-8-*-ImportUrllib,urllib2url="http://www.xiyounet.org/checkout/"user_agent="mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/47.0.2526.80 safari/537.36"Referer="http://www.xiyounet.org/checkout/"Values= {"username":"xxxxx","Password":"xxxxx"}headers= {'user-agent': User_agent,'Referer': Referer}data=Urllib.urlencode (values) Request=Urllib2. Request (url,data,headers) reponse=Urllib2.urlopen (Request)PrintReponse.read ()
Use of 5.cookie
⑴cookie refers to the data that some websites use to identify users, perform session tracking, and store them on the user's local terminal (typically encrypted). When we are in the crawler, if we encounter a landing site, if no landing is not allowed to crawl, we can obtain the cookie after the simulation landing, so as to achieve the purpose of crawling
Two important concepts in the URLLIB2:
- Openers: We all know Urlopen () This function, in fact, it is urllib2 function opener, in fact, we can also go to create their favorite opener
- Handler
http://www.jb51.net/article/46495.htm
⑵cookielib module: Its function is mainly to provide the storage of the cookie object with URLLIB2 to access the Internet we can use the module's Cookiejar class object to obtain cookies:
It is used in conjunction with the URLLIB2 module to simulate landing, the main methods are: Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar
#!usr/bin/python#Coding:utf-8ImportUrllib2ImportCookielib#declaring a Cookiejar object instance to hold a cookieCookie =Cookielib. Cookiejar ()#Create a cookie processor using the Httpcookieprocessor object of the URLLIB2 libraryHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler)#You can also request access laterResponse = Opener.open ("http://www.xiyounet.org/checkout/") forIteminchCookies:Print 'Name ='+Item.namePrint 'value ='+item.value
⑶ saving cookies to a file
#!usr/bin/python#Coding:utf-8ImportUrllib2ImportCookielibfilename='Cookie.txt'#declare a Mozillacookiejar object instance to hold the cookie and write to the fileCookie =Cookielib. Mozillacookiejar (filename)#Create a cookie processor using the Httpcookieprocessor object of the URLLIB2 libraryHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler)#You can also request access laterResponse = Opener.open ("http://www.xiyounet.org/checkout/")#two parameters of the Save method#ignore_discard: Saving cookies#ignore_expires: Overwrite if presentCookie.save (Ignore_discard = True,ignore_expires = True)
⑷ read from the file:
#Usr/bin/python#Coding:utf-8ImportCookielibImportUrllib2#Creating an Mozillacookiejar instance ObjectCookie =Cookielib. Mozillacookiejar ()#read cookie content from file to variableCookie.load ('Cookie.txt', Ignore_discard=true, ignore_expires=True)#to create the requested requestreq = Urllib2. Request ("http://www.xiyounet.org/checkout/")#use Urllib2 's Build_opener method to create a openerOpener =Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) Response=Opener.open (req)PrintResponse.read ()
⑸ actual Combat: Login Registration System
May be what permissions the server has set, this returns 400
#Usr/bin/python#Coding:utf-8ImportCookielibImportUrllib2ImportUrlliburl="http://www.xiyounet.org/checkout/index.php"PassData= Urllib.urlencode ({'Username':'SONGXL','Password':'Songxl123456'}) Cookiedata= {"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) gecko/20100101 firefox/44.0","Referer":"http://www.xiyounet.org/checkout/","Host":"http://www.xiyounet.org"}#set the file that holds the cookie, cookie.txt in the sibling directoryfilename ='Cookie.txt'#declares a Mozillacookiejar object instance to hold the cookie, and then writes the fileCookie =Cookielib. Mozillacookiejar (filename)#Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processorHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler) Req= Urllib2. Request (Url.encode ('Utf-8'), passdata,cookiedata) result=Opener.open (req)PrintResult.read ()
Python web crawler Getting Started notes