Some tips for using Python crawlers

Last Update:2016-06-23 Source: Internet

Author: User

Tags epoch time

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The most basic grasping station

Import urllib2content = Urllib2.urlopen (' Http://XXXX '). Read ()

-
2. Using a proxy server
This is useful in some cases, such as IP being blocked or, for example, the number of IP accesses being restricted, and so on.

Import Urllib2proxy_support = Urllib2. Proxyhandler ({' http ': ' Http://XX.XX.XX.XX:XXXX '}) opener = Urllib2.build_opener (Proxy_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) content = Urllib2.urlopen (' Http://XXXX '). Read ()

-
3. Need to log in the situation
Log in the situation more trouble I split up the question:
-
Processing of 3.1 Cookies

Import Urllib2, cookielibcookie_support= urllib2. Httpcookieprocessor (Cookielib. Cookiejar ()) opener = Urllib2.build_opener (Cookie_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) content = Urllib2.urlopen (' Http://XXXX '). Read ()

Yes, if you want to use proxies and cookies at the same time, then join Proxy_support and Operner instead

Opener = Urllib2.build_opener (Proxy_support, Cookie_support, Urllib2. HttpHandler)

-
3.2 Processing of forms
Login necessary forms, how to fill out the form? First, use the tool to intercept the content you want to fill out
For example, I usually use the Firefox+httpfox plugin to see what I sent the package
Let me give you an example of this, take VERYCD as an example, find your own post request, and the Post form item:

-
Can see VERYCD words need to fill username,password,continueuri,fk,login_submit these items, where FK is randomly generated (actually not too random, it looks like the epoch time through a simple code generation), Need to get from the Web page, that is, you have to first visit a Web page, using regular expressions and other tools to intercept the FK items in the returned data. Continueuri as the name implies can be casually written, login_submit is fixed, this from the source can be seen. And Username,password, that's pretty obvious.
-
Well, with the data to fill out, we're going to generate postdata.

Import Urllibpostdata=urllib.urlencode ({    ' username ': ' XXXXX ',    ' password ': ' XXXXX ',    ' Continueuri ': ' http://www.verycd.com/',    ' FK ': FK,    ' login_submit ': ' Login '})

-
The HTTP request is then generated, and then the request is sent:

req = Urllib2. Request (    url = ' http://secure.verycd.com/signin/*/http://www.verycd.com/',    data = postdata) result = Urllib2.urlopen (req). Read ()

-
3.3 Masquerading as browser access
Some websites resent the crawler's visit, so the crawler refuses to request
This time we need to disguise as a browser, which can be done by modifying the header in the HTTP packet
#...

headers = {    ' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}req = urllib2. Request (    url = ' http://secure.verycd.com/signin/*/http://www.verycd.com/',    data = postdata,    headers = Headers) # ...

-
3.4 Anti-"anti-hotlinking"
Some sites have so-called anti-hotlinking settings, in fact, it is very simple, is to check the header you sent the request inside, Referer site is not his own, So we just want to like 3.3, the headers Referer to the site can be, with a well-known Cnbeta for example:

#...headers = {    ' Referer ': ' Http://www.cnbeta.com/articles '}# ...

The

Headers is a DICT data structure that you can put into any desired header to do some camouflage. For example, some clever web site always like to peep at People's privacy, others through the proxy access, he just want to read the header of the x-forwarded-for to see someone else's real IP, no words, then directly x-forwarde-for change it, Can be changed to anything fun stuff to bully him, hehe.
-
3.5 Ultimate Trick
Sometimes even if you do 3.1-3.4, access or will be the basis, then no way, honestly put Httpfox see in the headers all write, that generally also on the line.
No more, it can only use the ultimate trick, selenium directly control the browser to access, as long as the browser can do, then it can do. There are similar pamie,watir, and so on.
-
4. Multi-threaded concurrent crawl
Single thread too slow, you need multiple threads, here is a simple thread pool template
This program simply prints 1-10, but it can be seen in parallel.

From threading import threadfrom queue import queuefrom time import Sleep#q is the task queue #num is the total number of concurrent threads #jobs is how many tasks q = queue () NUM = 2JO BS = 10# specific processing function, responsible for handling a single task Def do_somthing_using (arguments):    print arguments# This is a worker process, responsible for constantly fetching data from the queue and processing def working (): While    True:        arguments = Q.get ()        do_somthing_using (arguments)        sleep (1)        Q.task_done () #fork Num threads wait for a queue for I in range (NUM):    t = Thread (target=working)    T.setdaemon (True)    T.start () #把JOBS排入队列for I in Range (JOBS):    q.put (i) #等待所有JOBS完成q. Join ()

5. Processing of verification codes
What if I hit the verification code? This is handled in two different situations:
-
1.google kind of verification code, salad
-
2. Simple verification Code: the number of characters is limited, only use simple translation or rotation with noise without distortion, this is still possible to deal with, the general idea is to rotate back, the noise is removed, and then divided into a single character, divided well after the feature extraction method (such as PCA) to reduce the dimension and generate a feature library, Then compare the verification code with the feature library. This is more complicated, a blog post is not finished, here will not start, specific practices please make a study of the relevant textbooks.
-
3. In fact some of the verification code is still very weak, here is not named, anyway, I have 2 of the method to extract the very high accuracy of the verification code, so 2 is actually feasible.
-
6. Summary
Basically I have encountered all the situation, with the above methods have been successfully solved, not too clear there is no other missing situation, so this article to complete, and later if encountered other circumstances, and then add the relevant methods:

Some tips for using Python crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More