Some common Python crawling techniques and python crawling Techniques

Source: Internet
Author: User

Some common Python crawling techniques and python crawling Techniques

Python crawler: some common crawler skills

Crawlers also have a lot of reusable processes in the development process. Here, we will summarize some things that can be saved in the future.

1. Basic webpage capture

Get Method

import urllib2url "http://www.baidu.com"respons = urllib2.urlopen(url)print response.read()

Post method

import urllibimport urllib2url = "http://abcde.com"form = {'name':'abc','password':'1234'}form_data = urllib.urlencode(form)request = urllib2.Request(url,form_data)response = urllib2.urlopen(request)print response.read()

2. Use the proxy IP Address

During crawler development, the IP address is often blocked, and the proxy IP address is required;

The urllib2 package contains the ProxyHandler class. With this class, you can set a proxy to access the webpage, as shown in the following code snippet:

import urllib2proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})opener = urllib2.build_opener(proxy)urllib2.install_opener(opener)response = urllib2.urlopen('http://www.baidu.com')print response.read()

3. Cookies

Cookies are the data (usually encrypted) stored on users' local terminals by websites to identify users and track sessions. python provides the cookielib module to process cookies, the main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources.

Code snippet:

import urllib2, cookielibcookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

The key lies in CookieJar (), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The whole cookie is stored in the memory. After the CookieJar instance is garbage collected, the cookie will also be lost, and no separate operation is required for all processes.

Manually add cookie

Copy codeThe Code is as follows: cookie = "PHPSESSID = 91rurfqm2329bopnosfu4fvmu7; kmsign = 55d2c12c9b1e3; KMUID = b6Ejc1XSwPq9o756AxnBAg ="
Request. add_header ("Cookie", cookie)

4. disguise as a browser

Some websites dislike crawler visits, so they reject requests from crawlers. Therefore, HTTP Error 403: Forbidden often occurs when you directly access the website using urllib2.

Pay special attention to some headers. The Server will check these headers.

1) Some User-Agent servers or proxies will check this value to determine whether the Request was initiated by the browser.
2) When the Content-Type uses the REST interface, the Server checks the value to determine how to parse the Content in the HTTP Body.

In this case, you can modify the header in the http package. The code snippet is as follows:

import urllib2headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}request = urllib2.Request( url = 'http://my.oschina.net/jhao104/blog?catalog=3463517', headers = headers)print urllib2.urlopen(request).read()

5. Page Parsing

The most powerful form of page resolution is the regular expression, which is different for different users of different websites. Therefore, there is no need to describe too much. There are two better websites:

Regular Expression entry: http://www.bkjia.com/article/79618.htm

Regular Expression online testing: http://tool.oschina.net/regex/

The second step is the parsing library. Two lxml and BeautifulSoup are commonly used. We will introduce the two websites with better usage:

Lxml: http://my.oschina.net/jhao104/blog/639448

BeautifulSoup: http://cuiqingcai.com/1319.html

My comments for these two libraries are both HTML/XML processing libraries. Beautifulsoup is implemented in pure python with low efficiency, but its functions are practical, for example, you can use result search to obtain the source code of an HTML node. lxmlC is highly efficient and supports Xpath.

6. Verification Code Processing

Some simple verification codes can be easily identified. I have only performed some simple verification code identification. However, some anti-human verification codes, such as 12306, can be manually human bypass through the CAPTCHA human bypass platform. Of course, this is a payment.

7. gzip Compression

Have you ever encountered some web pages? No matter how transcoding is performed, they are all garbled. Haha, it means you do not know that many web services have the ability to send and compress data, which can reduce the amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web Services, because the XML data compression rate can be very high.

However, the server does not send compressed data for you unless you tell the server that you can process the compressed data.

You need to modify the Code as follows:

import urllib2, httplibrequest = urllib2.Request('http://xxxx.com')request.add_header('Accept-encoding', 'gzip') 1opener = urllib2.build_opener()f = opener.open(request)

This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can Accept gzip compressed data.

Then extract the data:

import StringIOimport gzipcompresseddata = f.read() compressedstream = StringIO.StringIO(compresseddata)gzipper = gzip.GzipFile(fileobj=compressedstream) print gzipper.read()

8. multi-thread concurrent capturing

If the single thread is too slow, multiple threads are required. Here, a simple thread pool template is provided. This program simply prints 1-10, but it can be seen that it is concurrent.

Although python's multithreading is quite tricky, the efficiency can be improved to a certain extent for the crawler-type network that is frequent.

From threading import Threadfrom Queue import Queuefrom time import sleep # q is the task Queue # NUM is the total number of concurrent threads # How many JOBS are there? q = Queue () NUM = 2 JOBS = 10 # specific processing function, responsible for processing a single task def do_somthing_using (arguments): print arguments # This is a working process, gets data from the queue and processes def working (): while True: arguments = q. get () do_somthing_using (arguments) sleep (1) q. task_done () # fork NUM threads waiting for alert ("Hello CSDN"); for I in range (NUM): t = Thread (target = working) t. setDaemon (True) t. start () # queue JOBS into the queue for I in range (JOBS): q. put (I) # Wait for all JOBS to complete q. join ()

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.