Python web crawler Getting Started notes

Source: Internet
Author: User
Tags urlencode python web crawler

Reference: http://www.cnblogs.com/xin-xin/p/4297852.html

First, Introduction

Crawler is a web crawler, if the Internet than to make a big net, then spiders are reptiles. If it encounters a resource, it will crawl down.

Second, the process

When we browse the Web page, we often see a variety of pages, in fact, this process is we enter the URL, the DNS resolution to the corresponding IP to find the corresponding server host, to the server issued a request, the server after parsing will html,js, etc. sent back to the browser display.

In fact, the crawler and the process is similar, but we crawl to the HTML, through the regular expression to determine what to get.

Third, the use of Urllib library

1. Grasp the HTML of the page:

# !/usr/bin/python # -*-coding:utf-8-*-   Import  'http://www.baidu.com'== response.read () Print HTML    

2. Structuring the request

For example, the above code can be rewritten like this:

# !/usr/bin/python # -*-coding:utf-8-*-   Import  'http://www.baidu.com'== = Response.read ()print  html    

Transmission of 3.GET and post data

POST:

Note: Only the Demo method because the website also has the header cookie and so on authentication code does not log in

#!/usr/bin/python#-*-coding:utf-8-*-Importurllib,urllib2values= {"username":"xxxxxx","Password":"xxxxxx"}data=urllib.urlencode (values) URL="http://www.xiyounet.org/checkout/"Request=Urllib2. Request (url,data) reponse=Urllib2.urlopen (Request)PrintReponse.read ()

GET:

#!/usr/bin/python#-*-coding:utf-8-*-Importurllib,urllib2values= {"username":"xxxxxx","Password":"xxxxxx"}data=urllib.urlencode (values) URL="http://www.xiyounet.org/checkout/"Geturl= URL +"?"+datarequest=Urllib2. Request (Geturl) reponse=Urllib2.urlopen (Request)PrintReponse.read ()

4. Set headers

Since most websites do not log on as above, in order to be able to simulate the browser more fully, it is necessary to learn the header

#!/usr/bin/python#-*-coding:utf-8-*-ImportUrllib,urllib2url="http://www.xiyounet.org/checkout/"user_agent="mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/47.0.2526.80 safari/537.36"Referer="http://www.xiyounet.org/checkout/"Values= {"username":"xxxxx","Password":"xxxxx"}headers= {'user-agent': User_agent,'Referer': Referer}data=Urllib.urlencode (values) Request=Urllib2. Request (url,data,headers) reponse=Urllib2.urlopen (Request)PrintReponse.read ()

Use of 5.cookie

⑴cookie refers to the data that some websites use to identify users, perform session tracking, and store them on the user's local terminal (typically encrypted). When we are in the crawler, if we encounter a landing site, if no landing is not allowed to crawl, we can obtain the cookie after the simulation landing, so as to achieve the purpose of crawling

Two important concepts in the URLLIB2:

    • Openers: We all know Urlopen () This function, in fact, it is urllib2 function opener, in fact, we can also go to create their favorite opener
    • Handler
http://www.jb51.net/article/46495.htm

⑵cookielib module: Its function is mainly to provide the storage of the cookie object with URLLIB2 to access the Internet we can use the module's Cookiejar class object to obtain cookies:

It is used in conjunction with the URLLIB2 module to simulate landing, the main methods are: Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar

#!usr/bin/python#Coding:utf-8ImportUrllib2ImportCookielib#declaring a Cookiejar object instance to hold a cookieCookie =Cookielib. Cookiejar ()#Create a cookie processor using the Httpcookieprocessor object of the URLLIB2 libraryHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler)#You can also request access laterResponse = Opener.open ("http://www.xiyounet.org/checkout/") forIteminchCookies:Print 'Name ='+Item.namePrint 'value ='+item.value

⑶ saving cookies to a file

    

#!usr/bin/python#Coding:utf-8ImportUrllib2ImportCookielibfilename='Cookie.txt'#declare a Mozillacookiejar object instance to hold the cookie and write to the fileCookie =Cookielib. Mozillacookiejar (filename)#Create a cookie processor using the Httpcookieprocessor object of the URLLIB2 libraryHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler)#You can also request access laterResponse = Opener.open ("http://www.xiyounet.org/checkout/")#two parameters of the Save method#ignore_discard: Saving cookies#ignore_expires: Overwrite if presentCookie.save (Ignore_discard = True,ignore_expires = True)

⑷ read from the file:

    

#Usr/bin/python#Coding:utf-8ImportCookielibImportUrllib2#Creating an Mozillacookiejar instance ObjectCookie =Cookielib. Mozillacookiejar ()#read cookie content from file to variableCookie.load ('Cookie.txt', Ignore_discard=true, ignore_expires=True)#to create the requested requestreq = Urllib2. Request ("http://www.xiyounet.org/checkout/")#use Urllib2 's Build_opener method to create a openerOpener =Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) Response=Opener.open (req)PrintResponse.read ()

⑸ actual Combat: Login Registration System

May be what permissions the server has set, this returns 400

#Usr/bin/python#Coding:utf-8ImportCookielibImportUrllib2ImportUrlliburl="http://www.xiyounet.org/checkout/index.php"PassData= Urllib.urlencode ({'Username':'SONGXL','Password':'Songxl123456'}) Cookiedata= {"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) gecko/20100101 firefox/44.0","Referer":"http://www.xiyounet.org/checkout/","Host":"http://www.xiyounet.org"}#set the file that holds the cookie, cookie.txt in the sibling directoryfilename ='Cookie.txt'#declares a Mozillacookiejar object instance to hold the cookie, and then writes the fileCookie =Cookielib. Mozillacookiejar (filename)#Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processorHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler) Req= Urllib2. Request (Url.encode ('Utf-8'), passdata,cookiedata) result=Opener.open (req)PrintResult.read ()

  

  

  

Python web crawler Getting Started notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.