Brief introduction-some common crawler skills in Python-python Crawler
First: basic web page capture
Get Method
Import urllib2
Url = "Link
Response = urllib2.urlopen (url)
Print response. read ()
Post method
Import urllib
Import urllib2
Url = "Link
Form = {'name': 'abc', 'Password': '000000 '}
Form_data = urllib. urlencode (form)
Request = urllib2.Request (url, form_data)
Response = urllib2.urlopen (request)
Print response. read ()
Type 2: Use the proxy IP Address
During crawler development, the IP address is often blocked, and the proxy IP address is required;
The urllib2 package contains the ProxyHandler class. With this class, you can set a proxy to access the webpage, as shown in the following code snippet:
Import urllib2
Proxy = urllib2.ProxyHandler ({'HTTP ': '2017. 0.0.1: 127 '})
Opener = urllib2.build _ opener (proxy)
Urllib2.install _ opener (opener)
Response = urllib2.urlopen ('link)
Print response. read ()
Third: Cookies
Cookies are the data (usually encrypted) stored on users' local terminals by websites to identify users and track sessions. python provides the cookielib module to process cookies, the main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources.
Code snippet:
import urllib2, cookielib
cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()
The key lies in CookieJar (), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The whole cookie is stored in the memory. After the CookieJar instance is garbage collected, the cookie will also be lost, and no separate operation is required for all processes.
Fourth: disguise as a browser
Some websites dislike crawler visits, so they reject requests from crawlers. Therefore, HTTP Error 403: Forbidden often occurs when you directly access the website using urllib2.
Pay special attention to some headers. The Server will check these headers.
1. Some User-Agent servers or proxies check this value to determine whether the Request was initiated by the browser;
2. When Content-Type uses the REST interface, the Server checks the value to determine how to parse the Content in the HTTP Body.
In this case, you can modify the header in the http package. The code snippet is as follows:
Import urllib2
Headers = {
'User-agent': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
Request = urllib2.Request (
Url = 'link,
Headers = headers
)
Print urllib2.urlopen (request). read ()
Category 5: page resolution
The most powerful form of page resolution is regular expressions, which are different for different users on different websites.
The second step is the parsing library. Two lxml and BeautifulSoup are commonly used. We will introduce the two websites with better usage:
Lxml: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
BeautifulSoup: http://cuiqingcai.com/1319.html
My comments for these two libraries are both HTML/XML processing libraries. Beautifulsoup is implemented in pure python with low efficiency, but its functions are practical, for example, you can use result search to obtain the source code of an HTML node. lxmlC is highly efficient and supports Xpath.
Type 6: Verification Code Processing
Some simple verification codes can be easily identified. I have only performed some simple verification code identification. However, some anti-human verification codes, such as 12306, can be manually human bypass through the CAPTCHA human bypass platform. Of course, this is a payment.
7. gzip Compression
Have you ever encountered some web pages? No matter how transcoding is performed, they are all garbled. Haha, it means you do not know that many web services have the ability to send and compress data, which can reduce the amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web Services, because the XML data compression rate can be very high.
However, the server does not send compressed data for you unless you tell the server that you can process the compressed data.
You need to modify the Code as follows:
Import urllib2, httplib
Request = urllib2.Request ('link)
Request. add_header ('Accept-encoding ', 'gzip ')
Opener = urllib2.build _ opener ()
F = opener. open (request)
This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can Accept gzip compressed data.
Then extract the data:
import StringIO
import gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()
Type 8: multi-thread concurrent capturing
If the single thread is too slow, multiple threads are required. Here, a simple thread pool template is provided. This program simply prints 1-10, but it can be seen that it is concurrent.
Although python's multithreading is quite tricky, the efficiency can be improved to a certain extent for the crawler-type network that is frequent.
From threading import Thread
From Queue import Queue
From time import sleep
# Q is a task queue.
# NUM indicates the total number of concurrent threads.
# How many JOBS are there?
Q = Queue ()
NUM = 2
JOBS = 10
# Specific processing functions, responsible for processing a single task
Def do_somthing_using (arguments ):
Print arguments
# This is a working process that constantly retrieves and processes data from the queue.
Def working ():
While True:
Arguments = q. get ()
Do_somthing_using (arguments)
Sleep (1)
Q. task_done ()
# Fork NUM threads waiting for the queue
For I in range (NUM ):
T = Thread (target = working)
T. setDaemon (True)
T. start ()
# Queue JOBS
For I in range (JOBS ):
Q. put (I)
# Wait until all JOBS are completed
Q. join ()