1. The most basic grab station import urllib2content = Urllib2.urlopen (' Http://XXXX '). Read ()-2. Using a proxy server This is useful in some cases, such as IP being blocked, Or, for example, the number of IP access restrictions and so on. Import Urllib2proxy_support = Urllib2. Proxyhandler ({' http ': ' Http://XX.XX.XX.XX:XXXX '}) opener = Urllib2.build_opener (Proxy_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) content = Urllib2.urlopen (' Http://XXXX '). Read ()-3. Log in when you need to sign in the situation more trouble I split the question: -3.1 processing of the cookie import urllib2, cookielibcookie_support= urllib2. Httpcookieprocessor (Cookielib. Cookiejar ()) opener = Urllib2.build_opener (Cookie_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) content = Urllib2.urlopen (' Http://XXXX '). Read () Yes, if you want to use proxies and cookies at the same time, Then join Proxy_support and Operner instead opener = Urllib2.build_opener (Proxy_support, Cookie_support, Urllib2. HttpHandler)-3.2 form processing login necessary forms, how to fill out the form? First use the tool to intercept the content to be filled out like I generally use the Firefox+httpfox plug-in to see what I sent the package this I would like to give an example, to VERYCD for example, first find their own POST request, and post form items:-You can see the VERYCD words need to fill username,password,continueuri,fk,login_submit these items, where FK is randomly generated (in fact, not too random, It looks like the epoch time is generated by a simple code, it needs to be fetched from a Web page, which means that you have to visit a webpage first.Use tools such as regular expressions to intercept the FK entries in the returned data. Continueuri as the name implies can be casually written, login_submit is fixed, this from the source can be seen. And Username,password, that's pretty obvious. -OK, with the data to fill in, we're going to generate Postdataimport Urllibpostdata=urllib.urlencode ({' username ': ' XXXXX ', ' Password ': ' XXXXX ', ' cont ' Inueuri ': ' http://www.verycd.com/', ' FK ': FK, ' login_submit ': ' Login '}-then generate an HTTP request, then send the request: req = Urllib2. Request (url = ' http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata) result = Urllib2.urlopen ( REQ). Read ()-3.3 disguised as a browser to visit some sites offensive crawler visits, so the crawler refused to request this time we need to disguise as a browser, which can be modified by the header in the HTTP packet to achieve #...headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}req = urllib2. Request (url = ' http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata, headers = headers) #: .-3.4 Anti-"anti-hotlinking" some sites have so-called anti-hotlinking settings, in fact, it is very simple, is to check the header you sent the request inside, Referer site is not his own, So we just need to be like 3.3, the headers Referer to the site can be known as a shady Cnbeta for example: #...headers = {' Referer ': ' Http://www.cnbeta.com/articles '}# ... headers is a DICT data structure that you can put in whatever you wantheader, to do some camouflage. For example, some clever web site always like to peep at People's privacy, others through the proxy access, he just want to read the header of the x-forwarded-for to see someone else's real IP, no words, then directly x-forwarde-for change it, Can be changed to anything fun stuff to bully him, hehe. -3.5 Ultimate Trick sometimes even if you do 3.1-3.4, access or will be according to, then no way, honestly put httpfox see headers All write on, that generally also on the line. No more, it can only use the ultimate trick, selenium directly control the browser to access, as long as the browser can do, then it can do. There are similar pamie,watir, and so on. -4. Multi-threaded concurrent crawl single thread too slow, you need more than a thread, here to a simple thread pool template This program is simply printed 1-10, but can be seen in parallel. From threading import threadfrom queue import queuefrom time import Sleep#q is the task queue #num is the total number of concurrent threads #jobs is how many tasks q = queue () NUM = 2JO BS = 10# specific processing function, responsible for handling a single task Def do_somthing_using (arguments): Print arguments# This is a worker process, responsible for constantly fetching data from the queue and processing def working (): While True:arguments = Q.get () do_somthing_using (arguments) sleep (1) q.task_done () #fork num threads waiting for queue F or I in range (NUM): t = Thread (target=working) T.setdaemon (True) T.start () #把JOBS排入队列for I in Range (JOBS): Q.pu T (i) #等待所有JOBS完成q. Join ()
Reprinted from: Http://blog.csdn.net/sding/archive/2011/02/28/6214207.aspx
The basic wording of Python's reptile