Python starter Web crawler Essentials Edition

Last Update:2015-08-19 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python starter Web crawler Essentials Edition

Python Learning web crawler is divided into 3 major sections: crawl , analyze , store
In addition, the more commonly used crawler frame scrapy, here is the final introduction.
First of all, please refer to the relevant reference: Ningo's Small station-web crawler

Crawl

This step, you want to be clear what is the content is? is the HTML source, or the JSON-formatted string, and so on.

1. The most basic crawl

Typically a GET request, data is fetched directly from the server.
First of all, Python with Urllib and urllib2 these two modules, basically can meet the General page crawl. In addition, requests is also a very useful package, similar to this, there are httplib2 and so on.

Requests:ImportRequests response = Requests.get (URL) content = Requests.get (URL). Content# string    Print "Response headers:", Response.headers# dict    Print "Content:", ContentUrllib2:ImportURLLIB2 response = urllib2.urlopen (URL) content = Urllib2.urlopen (URL). Read ()# string    Print "Response headers:", Response.headers# not Dict    Print "Content:", ContentHttplib2:ImportHttplib2 http = httplib2. Http () response_headers, content = http.request (URL,' GET ')Print "Response headers:", Response_headers# dict    Print "Content:", content

In addition, for Url,get requests with query fields, the data that is requested in the future is usually appended to the URL, in order to split the URL and transfer the data, with & connections for multiple parameters.

 Data = {'data1 ': 'XXXXX', 'data2 ': 'XXXXX'} # Dict typeRequests： Data is Dict,json     Import requestsResponse = Requests.get (Url=url, params= data) # GET Request SendUrllib2： Data is string     Import urllib, Urllib2     Data = Urllib.urlencode(data) # encoding work, converted from dict to stringFull_url = url+ '? ' + Data # GET Request SendResponse = Urllib2.urlopen (Full_url)

2. Handling of anti-crawler mechanisms

2.1 Analogue Landing conditions
This is the case of a POST request, which sends the form data to the server and the server then deposits the returned cookie locally.

 Data = {'data1 ': 'XXXXX', 'data2 ': 'XXXXX'} # Dict typeRequests： Data is Dict,json     Import requestsResponse = Requests.post (Url=url, Data=data) # POST request Send, can be used for user name password login situationUrllib2： Data is string     Import urllib, Urllib2     Data = Urllib.urlencode(data) # encoding work, converted from dict to stringreq = Urllib2.Request(Url=url, Data=data) # POST request Send, can be used for user name password login situationResponse = Urllib2.urlopen (req)

2.2 Use of cookies to log in
Using a cookie to log in, the server will think you are a logged-in user, so it will return you to a logged-in content. Therefore, the need to verify the code can be used with a verification code login cookie resolution.

import requests         requests_session = requests.session() # 创建会话对象requests.session()，可以记录cookieresponsedata=data) # requests_session的POST请求发送，可用于用户名密码登陆情况。必须要有这一步！拿到Response Cookie！

If there is a verification code, this time using response = Requests_session.post (Url=url_login, data=data) is not possible, the practice should be as follows:

response_captcha = requests_session.get(url=url_login, cookies=cookies)response1 = requests.get# 未登陆response2 = requests_session.get# 已登陆，因为之前拿到了Response Cookie！response3 = requests_session.get# 已登陆，因为之前拿到了Response Cookie！

Related references: Web crawler-Verification Code login
Reference items: Crawling to the website

2.3 Disguised as a browser, or anti-"anti-hotlinking"

headers = {‘User-Agent‘:‘XXXXX‘# 伪装成浏览器访问，适用于拒绝爬虫的网站headers = {‘Referer‘:‘XXXXX‘# 反“反盗链”，适用于有“反盗链”的网站headers = {‘User-Agent‘:‘XXXXX‘‘Referer‘:‘XXXXX‘}Requests：    response = requests.get(url=url, headers=headers)Urllib2：    import urllib, urllib2       req = urllib2.Request(url=url, headers=headers)    response = urllib2.urlopen(req)

3. Using proxies

Application: Limit the situation of IP address, but also can be resolved due to "frequent click" and need to enter the verification code login situation.

proxies = {‘http‘:‘http://XX.XX.XX.XX:XXXX‘}Requests：    import requests    response = requests.get(url=url, proxies=proxies)Urllib2：    import urllib2    proxy_support = urllib2.ProxyHandler(proxies)    opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)    # 安装opener，此后调用urlopen()时都会使用安装过的opener对象    response = urllib2.urlopen(url)

4. For wire break re-connection

def multi_session(session, *arg):    whileTrue:        20    while retryTimes>0:        try:            return session.post(*arg)        except:            print‘.‘,            1

def multi_open(opener, *arg):    whileTrue:        20    while retryTimes>0:        try:            return opener.open(*arg)        except:            print‘.‘,            1

This allows us to use multi_session or Multi_open to hold the crawler's session or opener.

5. Multi-Process Crawl

Here's a multi-process experimental comparison of Wall Street stories: Python multi-process fetching and Java multi-process fetching
Related references: Comparison of multi-process multithreaded computing methods for Python and Java

6. Handling of AJAX requests

For "load More" cases, use Ajax to transfer a lot of data. It works by loading the source code of the webpage from the URL of the webpage and executing the JavaScript program in the browser. These programs will load more content, "populate" into the Web page. This is why if you go directly to the URL of the page itself, you will not find the actual content of the page.
Here, if you use Google Chrome to analyze the "Request" link (method: Right → review element →network→ Empty, click "Load More", the corresponding get link appears to find the type of text/html, click, view get parameters or copy request URL), the loop process.
* If there is a page before "request", analyze and deduce the 1th page according to the URL of the previous step. And so on, grab the data that grabs the AJAX address.
* Regular matching of the returned JSON format data (str). In JSON format data, the Unicode_escape encoding from the ' \uxxxx ' form is converted to the Unicode encoding of U ' \uxxxx '.
Reference project: Python multi-process crawl

7. Verification Code Identification

We have three ways to verify the code on our website:

Use a proxy to update the IP.
Use cookies to log in.
Verification code identification.

Use a proxy and use cookies to log in before you have spoken, the following is the verification code identification.
Can utilize the open source TESSERACT-OCR system to carry on the verification code picture the download and the recognition, passes the recognition character to the reptile system to carry on the simulation landing. If it is unsuccessful, you can update the CAPTCHA recognition again until it succeeds.
Reference project: CAPTCHA1

Analysis

After the crawl is the content of the crawl analysis, what you need to extract the relevant content from it.
Common analysis tools include regular expressions, beautifulsoup,lxml, and so on.

Store

After analyzing what we need, the next step is to store it.
We can choose to save to a text file or to a MySQL or MongoDB database.

Scrapy

Scrapy is a twisted-based, open-source Python crawler framework that is widely used in industry.
Related content can be referenced based on the Scrapy network crawler construction

Python starter Web crawler Essentials Edition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More