Python web crawler Getting Started notes

Last Update:2016-03-03 Source: Internet

Author: User

Tags urlencode python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reference: http://www.cnblogs.com/xin-xin/p/4297852.html

First, Introduction

Crawler is a web crawler, if the Internet than to make a big net, then spiders are reptiles. If it encounters a resource, it will crawl down.

Second, the process

When we browse the Web page, we often see a variety of pages, in fact, this process is we enter the URL, the DNS resolution to the corresponding IP to find the corresponding server host, to the server issued a request, the server after parsing will html,js, etc. sent back to the browser display.

In fact, the crawler and the process is similar, but we crawl to the HTML, through the regular expression to determine what to get.

Third, the use of Urllib library

1. Grasp the HTML of the page:

# !/usr/bin/python # -*-coding:utf-8-*-   Import  'http://www.baidu.com'== response.read () Print HTML

2. Structuring the request

For example, the above code can be rewritten like this:

# !/usr/bin/python # -*-coding:utf-8-*-   Import  'http://www.baidu.com'== = Response.read ()print  html

Transmission of 3.GET and post data

POST:

Note: Only the Demo method because the website also has the header cookie and so on authentication code does not log in

#!/usr/bin/python#-*-coding:utf-8-*-Importurllib,urllib2values= {"username":"xxxxxx","Password":"xxxxxx"}data=urllib.urlencode (values) URL="http://www.xiyounet.org/checkout/"Request=Urllib2. Request (url,data) reponse=Urllib2.urlopen (Request)PrintReponse.read ()

GET:

#!/usr/bin/python#-*-coding:utf-8-*-Importurllib,urllib2values= {"username":"xxxxxx","Password":"xxxxxx"}data=urllib.urlencode (values) URL="http://www.xiyounet.org/checkout/"Geturl= URL +"?"+datarequest=Urllib2. Request (Geturl) reponse=Urllib2.urlopen (Request)PrintReponse.read ()

4. Set headers

Since most websites do not log on as above, in order to be able to simulate the browser more fully, it is necessary to learn the header

#!/usr/bin/python#-*-coding:utf-8-*-ImportUrllib,urllib2url="http://www.xiyounet.org/checkout/"user_agent="mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/47.0.2526.80 safari/537.36"Referer="http://www.xiyounet.org/checkout/"Values= {"username":"xxxxx","Password":"xxxxx"}headers= {'user-agent': User_agent,'Referer': Referer}data=Urllib.urlencode (values) Request=Urllib2. Request (url,data,headers) reponse=Urllib2.urlopen (Request)PrintReponse.read ()

Use of 5.cookie

⑴cookie refers to the data that some websites use to identify users, perform session tracking, and store them on the user's local terminal (typically encrypted). When we are in the crawler, if we encounter a landing site, if no landing is not allowed to crawl, we can obtain the cookie after the simulation landing, so as to achieve the purpose of crawling

Two important concepts in the URLLIB2:

Openers: We all know Urlopen () This function, in fact, it is urllib2 function opener, in fact, we can also go to create their favorite opener
Handler

http://www.jb51.net/article/46495.htm

⑵cookielib module: Its function is mainly to provide the storage of the cookie object with URLLIB2 to access the Internet we can use the module's Cookiejar class object to obtain cookies:

It is used in conjunction with the URLLIB2 module to simulate landing, the main methods are: Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar

#!usr/bin/python#Coding:utf-8ImportUrllib2ImportCookielib#declaring a Cookiejar object instance to hold a cookieCookie =Cookielib. Cookiejar ()#Create a cookie processor using the Httpcookieprocessor object of the URLLIB2 libraryHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler)#You can also request access laterResponse = Opener.open ("http://www.xiyounet.org/checkout/") forIteminchCookies:Print 'Name ='+Item.namePrint 'value ='+item.value

⑶ saving cookies to a file

#!usr/bin/python#Coding:utf-8ImportUrllib2ImportCookielibfilename='Cookie.txt'#declare a Mozillacookiejar object instance to hold the cookie and write to the fileCookie =Cookielib. Mozillacookiejar (filename)#Create a cookie processor using the Httpcookieprocessor object of the URLLIB2 libraryHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler)#You can also request access laterResponse = Opener.open ("http://www.xiyounet.org/checkout/")#two parameters of the Save method#ignore_discard: Saving cookies#ignore_expires: Overwrite if presentCookie.save (Ignore_discard = True,ignore_expires = True)

⑷ read from the file:

#Usr/bin/python#Coding:utf-8ImportCookielibImportUrllib2#Creating an Mozillacookiejar instance ObjectCookie =Cookielib. Mozillacookiejar ()#read cookie content from file to variableCookie.load ('Cookie.txt', Ignore_discard=true, ignore_expires=True)#to create the requested requestreq = Urllib2. Request ("http://www.xiyounet.org/checkout/")#use Urllib2 's Build_opener method to create a openerOpener =Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) Response=Opener.open (req)PrintResponse.read ()

⑸ actual Combat: Login Registration System

May be what permissions the server has set, this returns 400

#Usr/bin/python#Coding:utf-8ImportCookielibImportUrllib2ImportUrlliburl="http://www.xiyounet.org/checkout/index.php"PassData= Urllib.urlencode ({'Username':'SONGXL','Password':'Songxl123456'}) Cookiedata= {"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) gecko/20100101 firefox/44.0","Referer":"http://www.xiyounet.org/checkout/","Host":"http://www.xiyounet.org"}#set the file that holds the cookie, cookie.txt in the sibling directoryfilename ='Cookie.txt'#declares a Mozillacookiejar object instance to hold the cookie, and then writes the fileCookie =Cookielib. Mozillacookiejar (filename)#Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processorHandler =Urllib2. Httpcookieprocessor (Cookie)#build opener with handlerOpener =Urllib2.build_opener (handler) Req= Urllib2. Request (Url.encode ('Utf-8'), passdata,cookiedata) result=Opener.open (req)PrintResult.read ()

Python web crawler Getting Started notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More