Brief introduction-some common crawler skills in Python-python Crawler

Last Update:2017-04-26 Source: Internet

Author: User

Tags http cookie

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction-some common crawler skills in Python-python Crawler

First: basic web page capture

Get Method

Import urllib2
Url = "Link
Response = urllib2.urlopen (url)
Print response. read ()

Post method

Import urllib
Import urllib2
Url = "Link
Form = {'name': 'abc', 'Password': '000000 '}
Form_data = urllib. urlencode (form)
Request = urllib2.Request (url, form_data)
Response = urllib2.urlopen (request)
Print response. read ()

Type 2: Use the proxy IP Address

During crawler development, the IP address is often blocked, and the proxy IP address is required;

The urllib2 package contains the ProxyHandler class. With this class, you can set a proxy to access the webpage, as shown in the following code snippet:

Import urllib2
Proxy = urllib2.ProxyHandler ({'HTTP ': '2017. 0.0.1: 127 '})
Opener = urllib2.build _ opener (proxy)
Urllib2.install _ opener (opener)
Response = urllib2.urlopen ('link)
Print response. read ()

Third: Cookies

Cookies are the data (usually encrypted) stored on users' local terminals by websites to identify users and track sessions. python provides the cookielib module to process cookies, the main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources.

Code snippet:

import urllib2, cookielib
cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

The key lies in CookieJar (), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The whole cookie is stored in the memory. After the CookieJar instance is garbage collected, the cookie will also be lost, and no separate operation is required for all processes.

Fourth: disguise as a browser

Some websites dislike crawler visits, so they reject requests from crawlers. Therefore, HTTP Error 403: Forbidden often occurs when you directly access the website using urllib2.

Pay special attention to some headers. The Server will check these headers.

1. Some User-Agent servers or proxies check this value to determine whether the Request was initiated by the browser;

2. When Content-Type uses the REST interface, the Server checks the value to determine how to parse the Content in the HTTP Body.

In this case, you can modify the header in the http package. The code snippet is as follows:

Import urllib2
Headers = {
'User-agent': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
Request = urllib2.Request (
Url = 'link,
Headers = headers
)
Print urllib2.urlopen (request). read ()

Category 5: page resolution

The most powerful form of page resolution is regular expressions, which are different for different users on different websites.

The second step is the parsing library. Two lxml and BeautifulSoup are commonly used. We will introduce the two websites with better usage:

Lxml: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

BeautifulSoup: http://cuiqingcai.com/1319.html

My comments for these two libraries are both HTML/XML processing libraries. Beautifulsoup is implemented in pure python with low efficiency, but its functions are practical, for example, you can use result search to obtain the source code of an HTML node. lxmlC is highly efficient and supports Xpath.

Type 6: Verification Code Processing

Some simple verification codes can be easily identified. I have only performed some simple verification code identification. However, some anti-human verification codes, such as 12306, can be manually human bypass through the CAPTCHA human bypass platform. Of course, this is a payment.

7. gzip Compression

Have you ever encountered some web pages? No matter how transcoding is performed, they are all garbled. Haha, it means you do not know that many web services have the ability to send and compress data, which can reduce the amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web Services, because the XML data compression rate can be very high.

However, the server does not send compressed data for you unless you tell the server that you can process the compressed data.

You need to modify the Code as follows:

Import urllib2, httplib
Request = urllib2.Request ('link)
Request. add_header ('Accept-encoding ', 'gzip ')
Opener = urllib2.build _ opener ()
F = opener. open (request)

This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can Accept gzip compressed data.

Then extract the data:

import StringIO
import gzip
compresseddata = f.read() 
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream) 
print gzipper.read()

Type 8: multi-thread concurrent capturing

If the single thread is too slow, multiple threads are required. Here, a simple thread pool template is provided. This program simply prints 1-10, but it can be seen that it is concurrent.

Although python's multithreading is quite tricky, the efficiency can be improved to a certain extent for the crawler-type network that is frequent.

From threading import Thread
From Queue import Queue
From time import sleep
# Q is a task queue.
# NUM indicates the total number of concurrent threads.
# How many JOBS are there?
Q = Queue ()
NUM = 2
JOBS = 10
# Specific processing functions, responsible for processing a single task
Def do_somthing_using (arguments ):
Print arguments
# This is a working process that constantly retrieves and processes data from the queue.
Def working ():
While True:
Arguments = q. get ()
Do_somthing_using (arguments)
Sleep (1)
Q. task_done ()
# Fork NUM threads waiting for the queue
For I in range (NUM ):
T = Thread (target = working)
T. setDaemon (True)
T. start ()
# Queue JOBS
For I in range (JOBS ):
Q. put (I)
# Wait until all JOBS are completed
Q. join ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More