Describes the basic syntax of Python crawlers and python crawlers.

Source: Internet
Author: User
Tags epoch time

Describes the basic syntax of Python crawlers and python crawlers.

What is crawler?

Crawlers, that is, web crawlers, can be understood as crawlers that have been crawling on the Internet. The Internet is like a large network, and crawlers are crawlers that have crawled on the Internet, if it encounters a resource, it will capture it. What do you want to capture? You can control it.

For example, if it crawls a webpage and finds a path in it, which is actually a hyperlink to the webpage, then it can climb to another Internet to obtain data. In this way, the entire connected big network is within reach for the spider, and it is not a matter of minutes to climb down.

1. The most basic website capture

import urllib2content = urllib2.urlopen('http://XXXX').read()

2. Use a Proxy Server

This is useful in some situations, such as the IP address being blocked, or the number of IP addresses being accessed is limited.

import urllib2proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'})opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

3. logon required

I have trouble splitting the problem:

3.1.cookie Processing

import urllib2, cookielibcookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

Yes. If you want to use both proxy and cookie, add proxy_support and change operner

opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)

3.2 Form Processing

Do I need to fill in the form for Logon? First, use a tool to intercept the content of the table to be filled in.
For example, I usually use the firefox + httpfox plug-in to see what packages I have actually sent.
Here is an example. Take verycd as an example. First, find your POST request and the POST form item:

You can see that if verycd is used, you need to enter the username, password, continueURI, fk, and login_submit items, where fk is randomly generated (in fact, it is not random, it looks like the epoch time is generated by a simple code. You need to obtain the epoch time from the webpage. That is to say, you must first access the webpage and use regular expressions and other tools to intercept the fk items in the returned data. As the name suggests, continueURI can be written at will. login_submit is fixed, which can be seen from the source code. And username and password.

Okay. With the data to be filled in, we need to generate postdata.

Import urllibpostdata = urllib. urlencode ({'username': 'xxxxx', 'Password': 'xxxxx', 'continuuri': 'http: // response '})

Then generate an http request and then send the request:

req = urllib2.Request(url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',data = postdata)result = urllib2.urlopen(req).read()

3.3 disguised as browser access

Some websites dislike crawler visits, so they reject requests from crawlers.

In this case, we need to pretend to be a browser, which can be achieved by modifying the header in the http packet.

#…headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}req = urllib2.Request(url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',data = postdata,headers = headers)#...

3.4 anti-leeching"

Some sites have so-called anti-leeching settings. In fact, it is very simple to say that it is to check whether the referer site is in the header of your request, so we only need to be like 3.3, change the referer of headers to this website. The cnbeta, known as the black screen, is used as an example:

#...headers = {'Referer':'http://www.cnbeta.com/articles'}#...

Headers is a dict data structure. You can put any desired header for some disguise. For example, some smart websites always like to look at people's privacy. If someone else accesses the website via proxy, He must read X-Forwarded-For in the header to see the real IP address of others, let's change X-Forwarde-For directly. You can change it to something fun to bully and bully him.

3.5 ultimate success

Sometimes, even if we do 3.1-3.4, the access will still get data, so there is no way, honestly write all the headers seen in httpfox, that would be fine.
No, you can only use the ultimate trick. selenium directly controls the browser to access the website. As long as the browser can do so, it can do the same. For example, pamie, watir, and so on.

4. multi-thread concurrent capturing

print time.time()-start1..from threading import Threadfrom Queue import Queuefrom time import sleepimport urllib2import timeq = Queue()NUM = 10JOBS = 50def do_somthing_using(p):response = urllib2.urlopen('http://www.cnblogs.com')result = response.read()#print pdef working():while True:arguments = q.get()do_somthing_using(arguments)q.task_done()for i in xrange(NUM):t = Thread(target=working)t.setDaemon(True)t.start()start = time.time()for i in xrange(JOBS):q.put(i)q.join()print "MultiThreading:"print time.time()-start2..from threading import Threadfrom multiprocessing.dummy import Pool as ThreadPoolimport urllib2import timestart = time.time()url = "http://www.cnblogs.com"urls = [url] * 50pool = ThreadPool(4)results = pool.map(urllib2.urlopen, urls)pool.close()pool.join()print "Map:"print time.time()-start

I will introduce you to the basic writing of Python crawlers here. It will be further refined in the future. Please stay tuned. Thank you!

Articles you may be interested in:
  • Python simulates Sina Weibo login (Sina Weibo crawler)
  • Install and use the Python crawler framework Scrapy
  • Using urllib2 to capture webpage content
  • Zero-Basic python crawler-based HTTP Exception Handling
  • Basic expression for writing python Crawlers
  • Python-based code sharing
  • Python implements simple crawler sharing for crawling links on pages
  • Python3 simple Crawler
  • Python-making huaban web beauty image Crawlers
  • Python makes simple Web Crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.