since doing after Python has been developed, it is often used in the work Python programming language (http://www.maiziedu.com/course/python/) and Found Python's Most frequently used scenarios are Web Rapid Development, crawler, and automated operations: writing simple Web sites, writing automatic post scripts, writing email scripts, and writing simple verification code identification scripts.
Crawler in the development process also has a lot of reuse process, here to summarize, in the future can save some things.
1, Basic crawl Web page
Get Method
Import Urllib2
url = "Http://www.baidu.com"
Response = Urllib2.urlopen (URL)
Print Response.read ()
Post Method
Import Urllib
Import Urllib2
url = "Http://abcde.com"
form = {' name ': ' abc ', ' Password ': ' 1234 '}
Form_data = Urllib.urlencode (form)
Request = Urllib2. Request (Url,form_data)
Response = Urllib2.urlopen (Request)
Print Response.read ()
2. Using proxy IP
During the development of the crawler, you will often encounter IP when the case is blocked, the agent is needed. IP ;
There are Proxyhandler classes in the urllib2 package that allow you to set up a proxy to access the Web page, as in the following code snippet:
Import Urllib2
Proxy = Urllib2. Proxyhandler ({' http ': ' 127.0.0.1:8087 '})
Opener = Urllib2.build_opener (proxy)
Urllib2.install_opener (opener)
Response = Urllib2.urlopen (' http://www.baidu.com ')
Print Response.read ()
3.Cookie processing
CookiesSome websites are designed to identify users,Sessiondata that is tracked and stored on the user's local terminal(usually encrypted),pythonprovides aCookielibModule for handlingCookies,CookielibThe main function of the module is to provide a storageCookiesobjects to facilitate theUrllib2use the module to accessInternetResources.
Code snippet:
Import Urllib2, Cookielib
Cookie_support= Urllib2. Httpcookieprocessor (Cookielib. Cookiejar ())
Opener = Urllib2.build_opener (Cookie_support)
Urllib2.install_opener (opener)
content = urllib2.urlopen (' Http://XXXX '). Read ()
  cookiejar () http cookie http request generated , to outgoing http request to add cookie cookie cookiejar instance garbage collected cookie
Manually add Cookies
cookie = "PHPSESSID=91RURFQM2329BOPNOSFU4FVMU7; KMSIGN=55D2C12C9B1E3; Kmuid=b6ejc1xswpq9o756axnbag= "
Request.add_header ("Cookie", cookie)
4. Disguised as a browser
Some websites resent the crawler's visit, so they refuse to request the crawler. So with urllib2 Direct access to the site will often appear HTTP Error 403:forbidden situation
Special attention shouldbe paid to some of the headers , which are checked against these headers by theserver side.
1.user-agent some servers or proxies Check this value to determine whether a browser-initiated Request
2.content-type when using the REST interface,theServer checks the value to determine the HTTP Body How to parse the content in the.
This can be done by modifying the header in the http package as follows:
Import Urllib2
headers = {
' User-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '
}
Request = Urllib2. Request (
url = ' http://my.oschina.net/jhao104/blog?catalog=3463517 ',
headers = Headers
)
Print Urllib2.urlopen (Request). Read ()
5. Page parsing
for page parsing the most powerful of course is the regular expression, this for different sites different users are not the same, do not have too much to explain, with two more good URLs:
Getting started with regular expressions: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
Regular Expression online testing: http://tool.oschina.net/regex/
The second is the analysis of the library, commonly used there are two lxml and beautifulsoup, for the use of the two introduced two relatively good site:
lxml: http://my.oschina.net/jhao104/blog/639448
beautifulsoup: http://cuiqingcai.com/1319.html
For these two libraries, my evaluation is, are html/xml processing Library,beautifulsoup pure python Implementation, low efficiency, but functional practical , such as the ability to use results to obtain an HTML node source code;LXMLC language encoding, efficient, Xpath Support
6, the Verification code processing
For some simple verification codes, simple identification can be done. I have only done some simple verification code identification. But some anti-human verification codes, such as 12306, can be manually coded by code platform, of course, this is to pay.
7,gzip compression
have encountered some Web pages, no matter how the transcoding is a mess of garbled. Haha, that means you don't know. Many Web services have the ability to send compressed data, which can reduce the amount of data transmitted over a network line by 60% . This is especially true for XML Web services, because The compression rate of XML data can be high.
However, the general server does not send compressed data for you unless you tell the server that you can process the compressed data.
So you need to modify the code like this:
Import Urllib2, Httplib
Request = Urllib2. Request (' http://xxxx.com ')
Request.add_header (' accept-encoding ', ' gzip ') 1
Opener = Urllib2.build_opener ()
f = opener.open (Request)
This is the key : Create a Request object and add a accept-encoding Header message to tell the server you can accept Gzip Compressed Data
And then unzip the data:
Import Stringio
Import gzip
Compresseddata = F.read ()
Compressedstream = Stringio.stringio (compresseddata)
Gzipper = gzip. Gzipfile (Fileobj=compressedstream)
Print Gzipper.read ()
8, multi-threaded concurrent crawl
single thread too slow, you need to multi-threading, here to a simple thread pool template This program is simply printed 1-10 , but it can be seen as concurrency.
Although python 's multithreading is very chicken, but for crawler this kind of network frequent type, still can improve the efficiency to a certain extent.
From threading Import Thread
From queue import queue
From time import sleep
# Q is the task queue
#NUM is the total number of concurrent threads
#JOBS How many missions are there?
Q = Queue ()
NUM = 2
JOBS = 10
# specific processing functions, responsible for handling individual tasks
def do_somthing_using (arguments):
Print arguments
# This is a work process, responsible for constantly fetching data from the queue and processing
def working ():
While True:
arguments = Q.get ()
Do_somthing_using (arguments)
Sleep (1)
Q.task_done ()
#fork NUM threads waiting for a queue
For I in Range (NUM):
t = Thread (target=working)
T.setdaemon (True)
T.start ()
# put JOBS in the queue #
For I in Range (JOBS):
Q.put (i)
# wait for all JOBS to finish
Q.join ()
Python crawler tips which are more practical