Summary of common Python crawler skills and python crawler skills

Source: Internet
Author: User
Tags http cookie

Summary of common Python crawler skills and python crawler skills

Python has been used for more than a year. The scenarios with the largest number of python applications are web rapid development, crawling, and automated O & M: I have written simple websites, automatic post scripts, email sending and receiving scripts, and simple verification code recognition scripts.

Crawlers also have a lot of reusable processes in the development process. Here, we will summarize some things that can be saved in the future.

1. Basic webpage capture

Get Method

import urllib2 url = "http://www.baidu.com"response = urllib2.urlopen(url)print response.read()

Post method

import urllibimport urllib2 url = "http://abcde.com"form = {'name':'abc','password':'1234'}form_data = urllib.urlencode(form)request = urllib2.Request(url,form_data)response = urllib2.urlopen(request)print response.read()

2. Use the proxy IP Address

During crawler development, the IP address is often blocked, and the proxy IP address is required;

The urllib2 package contains the ProxyHandler class. With this class, you can set a proxy to access the webpage, as shown in the following code snippet:

import urllib2 proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})opener = urllib2.build_opener(proxy)urllib2.install_opener(opener)response = urllib2.urlopen('http://www.baidu.com')print response.read()

3. Cookies

Cookies are the data (usually encrypted) stored on users' local terminals by websites to identify users and track sessions. python provides the cookielib module to process cookies, the main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources.

Code snippet:

import urllib2, cookielib cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

The key lies in CookieJar (), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The whole cookie is stored in the memory. After the CookieJar instance is garbage collected, the cookie will also be lost, and no separate operation is required for all processes.

Manually add cookie

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="request.add_header("Cookie", cookie)

4. disguise as a browser

Some websites dislike crawler visits, so they reject requests from crawlers. Therefore, HTTP Error 403: Forbidden often occurs when you directly access the website using urllib2.

Pay special attention to some headers. The Server will check these headers.

1. Some User-Agent servers or proxies check this value to determine whether the Request was initiated by the browser.

2. When the Content-Type uses the REST interface, the Server checks the value to determine how the Content in the HTTP Body should be parsed.

In this case, you can modify the header in the http package. The code snippet is as follows:

import urllib2 headers = {  'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}request = urllib2.Request(  url = 'http://my.oschina.net/jhao104/blog?catalog=3463517',  headers = headers)print urllib2.urlopen(request).read()

5. Page Parsing

The most powerful form of page resolution is the regular expression, which is different for different users of different websites. Therefore, there is no need to describe too much. There are two better websites:

Regular Expression entry: http://www.bkjia.com/article/18526.htm

Regular Expression online testing: http://tools.jb51.net/regex/javascript

The second step is the parsing library. Two lxml and BeautifulSoup are commonly used. We will introduce the two websites with better usage:

Lxml: http://www.bkjia.com/article/67125.htm

BeautifulSoup: http://www.bkjia.com/article/43572.htm

My comments for these two libraries are both HTML/XML processing libraries. Beautifulsoup is implemented in pure python with low efficiency, but its functions are practical, for example, you can use result search to obtain the source code of an HTML node. lxmlC is highly efficient and supports Xpath.

6. Verification Code Processing

Some simple verification codes can be easily identified. I have only performed some simple verification code identification. However, some anti-human verification codes, such as 12306, can be manually human bypass through the CAPTCHA human bypass platform. Of course, this is a payment.

7. gzip Compression

Have you ever encountered some web pages? No matter how transcoding is performed, they are all garbled. Haha, it means you do not know that many web services have the ability to send and compress data, which can reduce the amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web Services, because the XML data compression rate can be very high.

However, the server does not send compressed data for you unless you tell the server that you can process the compressed data.

You need to modify the Code as follows:

import urllib2, httplibrequest = urllib2.Request('http://xxxx.com')request.add_header('Accept-encoding', 'gzip')    1opener = urllib2.build_opener()f = opener.open(request)

This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can Accept gzip compressed data.

Then extract the data:

import StringIOimport gzip compresseddata = f.read() compressedstream = StringIO.StringIO(compresseddata)gzipper = gzip.GzipFile(fileobj=compressedstream) print gzipper.read()

8. multi-thread concurrent capturing

If the single thread is too slow, multiple threads are required. Here, a simple thread pool template is provided. This program simply prints 1-10, but it can be seen that it is concurrent.

Although python's multithreading is quite tricky, the efficiency can be improved to a certain extent for the crawler-type network that is frequent.

From threading import Threadfrom Queue import Queuefrom time import sleep # q is the task Queue # NUM is the total number of concurrent threads # How many JOBS are there? q = Queue () NUM = 2 JOBS = 10 # specific processing function, responsible for processing a single task def do_somthing_using (arguments): print arguments # This is a working process, gets data from the queue and processes def working (): while True: arguments = q. get () do_somthing_using (arguments) sleep (1) q. task_done () # fork NUM threads waiting queue for I in range (NUM): t = Thread (target = working) t. setDaemon (True) t. start () # queue JOBS into the queue for I in range (JOBS): q. put (I) # Wait for all JOBS to complete q. join ()

Articles you may be interested in:
  • Python simulates Sina Weibo login (Sina Weibo crawler)
  • Install and use the Python crawler framework Scrapy
  • Python web image capture example (python crawler)
  • Full record of crawler writing for python Crawlers
  • Python-based code sharing
  • Python implements simple crawler sharing for crawling links on pages
  • Python3 simple Crawler
  • Write crawler programs in python
  • Multi-thread web crawler using python
  • Python makes simple Web Crawler
  • Python crawler simulated logon website with verification code
  • Write simple Weibo crawlers in Python
  • Simple Example of Python multi-thread Crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.