Some commonly used crawler techniques are summed up with the following points:
1, Basic crawl Web page
Get method
Import Urllib2
URL "http://www.baidu.com"
respons = Urllib2.urlopen (URL)
Print Response.read ()
Post method
Import Urllib
Import Urllib2
url = "Http://abcde.com"
form = {' name ': ' abc ', ' Password ': ' 1234 '}
Form_data = Urllib.urlencode (form)
Request = Urllib2. Request (Url,form_data)
Response = Urllib2.urlopen (Request)
Print Response.read ()
2. Using Proxy IP
In the development of the crawler will often encounter IP is blocked, then need to use proxy IP;
There are Proxyhandler classes in the URLLIB2 package that allow you to set up a proxy to access the Web page, as in the following code snippet:
Import Urllib2
Proxy = Urllib2. Proxyhandler ({' http ': ' 127.0.0.1:8087 '})
Opener = Urllib2.build_opener (proxy)
Urllib2.install_opener (opener)
Response = Urllib2.urlopen (' http://www.baidu.com ')
Print Response.read ()
3. Cookie Processing
Cookies are data (usually encrypted) that are stored on the user's local terminal in order to identify the user, perform session tracking, and Python provides a cookielib module for processing cookies, The primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module.
Code snippet:
Import Urllib2, Cookielib
Cookie_support=urllib2. Httpcookieprocessor (Cookielib. Cookiejar ())
Opener = Urllib2.build_opener (Cookie_support)
Urllib2.install_opener (opener)
Content = Urllib2.urlopen (' Http://XXXX '). Read ()
The key is Cookiejar (), which manages the HTTP cookie value, stores the cookie generated by the HTTP request, and adds a cookie to the outgoing HTTP request. The entire cookie is stored in memory, and the cookie is lost after the Cookiejar instance is garbage collected, and all processes do not need to be operated separately.
Adding cookies manually
The code is as follows:
cookie = "PHPSESSID=91RURFQM2329BOPNOSFU4FVMU7; KMSIGN=55D2C12C9B1E3; Kmuid=b6ejc1xswpq9o756axnbag= "
Request.add_header ("Cookie", cookie)
4. Disguised as a browser
Some websites resent the crawler's visit, so they refuse to request the crawler. So with URLLIB2 direct access to the site will often appear HTTP Error 403:forbidden situation
Special attention should be paid to some of the headers, which are checked against these headers by the server side.
1). User-agent Some servers or proxies check this value to determine whether a browser-initiated Request
2). Content-type when using the REST interface, the Server checks the value to determine how the content in the HTTP Body should be parsed.
This can be done by modifying the header in the HTTP package as follows:
Import Urllib2
headers = {
' User-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '
}
Request = Urllib2. Request (
url = ' http://my.oschina.net/jhao104/blog?catalog=3463517 ',
headers = Headers
)
Print Urllib2.urlopen (Request). Read ()
5. Page parsing
For page parsing the most powerful of course is the regular expression, this for different Web site different users are not the same, do not have too much to explain
6, the Verification Code processing
For some simple verification codes, simple identification can be done. I have only done some simple verification code identification. But some anti-human verification code, such as 12306, can be manually coded by the code platform, of course, this is to pay.
7, gzip compression
Have encountered some Web pages, no matter how the transcoding is a mess of garbled. Haha, that means you don't know. Many Web services have the ability to send compressed data, which can reduce the amount of data transmitted over a network line by more than 60%. This is especially true for XML Web services, because the compression rate of XML data can be high.
However, the general server does not send compressed data for you unless you tell the server that you can process the compressed data.
So you need to modify the code like this:
Import Urllib2, Httplib
Request = Urllib2. Request (' http://xxxx.com ')
Request.add_header (' accept-encoding ', ' gzip ') 1
Opener = Urllib2.build_opener ()
f = opener.open (Request)
This is the key: Create a Request object, add a accept-encoding header to tell the server you can accept gzip compressed data
And then unzip the data:
Import Stringio
Import gzip
Compresseddata = F.read ()
Compressedstream = Stringio.stringio (compresseddata)
Gzipper = gzip. Gzipfile (Fileobj=compressedstream)
Print Gzipper.read ()
8, multi-threaded concurrent crawl
Single thread too slow, you need to multi-threading, here to a simple thread pool template This program is simply printed 1-10, but can be seen to be concurrent.
Although Python's multithreading is very chicken, but for crawler this kind of network frequent type, still can improve the efficiency to a certain extent.
From threading Import Thread
From queue import queue
From time import sleep
# Q is the task queue
#NUM是并发线程总数
#JOBS是有多少任务
Q = Queue ()
NUM = 2
JOBS = 10
#具体的处理函数, responsible for handling individual tasks
def do_somthing_using (arguments):
Print arguments
#这个是工作进程, responsible for continuously fetching data from the queue and processing
def working ():
While True:
arguments = Q.get ()
Do_somthing_using (arguments)
Sleep (1)
Q.task_done ()
#fork num threads waiting for a
Alert ("Hello CSDN");
For I in Range (NUM):
t = Thread (target=working)
T.setdaemon (True)
T.start ()
#把JOBS排入队列
For I in Range (JOBS):
Q.put (i)
#等待所有JOBS完成
Q.join ()
The above is the whole content of this article, I hope that everyone's study has helped. (Many data sharing)
Python: Crawler Tips Summary!