Python Starter Web Crawler Essentials Edition

Last Update:2017-07-19 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python Starter Web Crawler Essentials Edition

Reproduced Ning Brother's station, summed up a good

Python Learning web crawler is divided into 3 major sections: crawl , analyze , store

In addition, more commonly used crawler frame scrapy, here at the end of the detailed Introduction.

First of all, I summarize the relevant articles, which covers the basic concepts and techniques needed to get started web crawler: Ningo's small station-web crawler

What happens in the background when we enter a URL in the browser and return to the back? For example, If you enter http://www.lining0806.com/, you will see Ningo's home page.

In a nutshell, the following four steps occur in this process:

Find the IP address that corresponds to the domain Name.
Send the request to the server that corresponds to the Ip.
The server responds to the request and sends back the page Content.
The browser parses the Web page content.

Web crawler to do, simply speaking, is to achieve the function of the Browser. By specifying a url, the data is returned directly to the user without the need to manually manipulate the browser to get it.

Crawl

This step, what do you want to be clear about? is the HTML source, or a json-formatted string, and so On.

1. The most basic crawl

Crawling most situations are GET requests that fetch data directly from the other Server.

First of all, Python with Urllib and urllib2 These two modules, basically can meet the General page Crawl. In addition, requests is also a very useful package, similar to this, there are httplib2 and so On.

Requests:Import Requests response = Requests.get (url) content = requests.get (url). contentprint "response headers:", response.headers print "content:", contentUrllib2: import urllib2 respon SE = urllib2.urlopen (url) content = Urllib2.urlopen (url). read () print "response headers:", response.headers  Print "content:", contentHttplib2: import httplib2 http = httplib2. Http () response_headers, content = http.request (url, ' GET ') print "response headers:", response_headers C13>print "content:", content

In addition, for Url,get requests with query fields, the data that is requested in the future is usually appended to the url, in order to split the URL and transfer the data, with & connections for multiple Parameters.

data ={‘data1 ': ' xxxxx ', ' data2 ': ' requests: Data is Dict,json import Requests response = Requests.get (url=url , Params=data) urllib2:data to string  Import urllib, urllib2 data = Urllib.urlencode (data) full_url = url+ '? ' +data response = Urllib2.urlopen (full_url)

Related reference: NetEase News leaderboard Crawl Review

Reference Project: web crawler The most basic crawler: crawl NetEase News Rankings

2. Handling of landing conditions

2.1 Login using Form

This is a POST request that sends the form data to the server before the server stores the returned cookie Locally.

data ={‘Data1 ': 'XXXXX ', 'Data2 ': 'xxxxx '} requests:data for Dict,json import Requests response = Requests.post (url=url, data=data" Urllib2: data to string  Import urllib, urllib2 data = Urllib.urlencode (data) req = Urllib2. Request (url=url, data=data) Response = Urllib2.urlopen (req)

2.2 Using cookies to log in

Using a cookie to log in, the server will think you are a logged-in user, so it will return you to a logged-in Content. therefore, the need to verify the code can be used with a verification code login cookie Resolution.

import requests         requests_session = requests.session() response = requests_session.post(url=url_login, data=data)

If there is a verification code, this time using response = Requests_session.post (url=url_login, Data=data) is not possible, the practice should be as follows:

response_captcha = requests_session.get(url=url_login, cookies=cookies)response1 = requests.get(url_login) # 未登陆response2 = requests_session.get(url_login) # 已登陆，因为之前拿到了Response Cookie！response3 = requests_session.get(url_results) # 已登陆，因为之前拿到了Response Cookie！

Related references: web crawler-verification code Login

Reference project: web crawler user name password and verification code login: crawl to know the site

3. Handling of Anti-crawler mechanisms

3.1 Using proxies

Application: limit the situation of IP address, but also can be resolved due to "frequent click" and need to enter the verification code login Situation.

The best way to do this is to maintain a proxy IP pool, the internet has a lot of free proxy ip, good and bad, can be found through screening can be Used. For "frequent clicks", We can also prevent the site from being banned by restricting the frequency of crawlers accessing the Site.

proxies = {‘http‘:‘http://XX.XX.XX.XX:XXXX‘}Requests：    import requests    response = requests.get(url=url, proxies=proxies)Urllib2：    import urllib2    proxy_support = urllib2.ProxyHandler(proxies)    opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)    urllib2.install_opener(opener) # 安装opener，此后调用urlopen()时都会使用安装过的opener对象 response = urllib2.urlopen(url)

3.2 Time Settings

Application: Limit Frequency Situation.

REQUESTS,URLLIB2 can use the sleep () function of the time library:

timetime.sleep(1)

3.3 Disguised as a browser, or anti-"anti-hotlinking"

Some websites will check whether you are actually accessing the browser or the machine is automatically accessed. This situation, plus user-agent, indicates that you are a browser access. Sometimes check if you have referer information and check if your referer is legal, generally plus referer.

headers = {‘User-Agent‘:‘XXXXX‘} # 伪装成浏览器访问，适用于拒绝爬虫的网站headers = {‘Referer‘:‘XXXXX‘}headers = {‘User-Agent‘:‘XXXXX‘, ‘Referer‘:‘XXXXX‘}Requests： response = requests.get(url=url, headers=headers)Urllib2： import urllib, urllib2 req = urllib2.Request(url=url, headers=headers) response = urllib2.urlopen(req)

4. for wire break re-connection

Not much to Say.

def multi_session(session, *arg):    retryTimes = 20 while retryTimes>0: try: return session.post(*arg) except: print ‘.‘, retryTimes -= 1

def multi_open(opener, *arg):    retryTimes = 20 while retryTimes>0: try: return opener.open(*arg) except: print ‘.‘, retryTimes -= 1

This allows us to use multi_session or Multi_open to hold the Crawler's session or Opener.

5. Multi-process Crawl

Here's an experimental comparison of parallel crawls of Wall Street stories: python multi-process fetching and Java single-threaded and multi-threaded fetching

Related references: Comparison of multi-process multithreaded computing methods for Python and Java

6. Handling of AJAX requests

For "load more" cases, use Ajax to transfer a lot of Data.

It works by loading the source code of the webpage from the URL of the webpage and executing the JavaScript program in the Browser. These programs will load more content, "populate" into the web Page. This is why if you go directly to the URL of the page itself, you will not find the actual content of the Page.

here, If you use Google Chrome to analyze the "request" link (method: right → review element →network→ empty, Click "load more", the corresponding get link appears to find the type of text/html, click, view get parameters or copy request URL), the Loop process.

If there is a page before "request", analyze and deduce the 1th page according to the URL of the previous Step. And so on, grab the data that grabs the Ajax Address.
A regular match is made to the returned JSON format data (str). In JSON format data, the Unicode_escape encoding from the ' \uxxxx ' form is converted to the Unicode encoding of U ' \uxxxx '.

7. Automated Test Tools Selenium

Selenium is an automated testing Tool. It can manipulate the browser, including the character fill, mouse click, get elements, page switching and a series of operations. In short, whatever the browser can do, selenium can do it.

Here is the code that uses selenium to dynamically crawl the fare information for a given list of Cities.

Reference Project: web crawler Selenium use proxy login: crawl to where site

8. Verification Code Identification

We have three ways to verify the code on our Website:

Use a proxy to update the IP.
Use cookies to log in.
Verification Code Identification.

Use a proxy and use cookies to log in before you have spoken, the following is the verification code Identification.

Can utilize the open source TESSERACT-OCR system to carry on the verification code picture the download and the recognition, passes the recognition character to the reptile system to carry on the simulation landing. of course, You can also upload the verification code image to the code platform for Identification. If it is unsuccessful, you can update the CAPTCHA recognition again until it SUCCEEDS.

Reference project: Verification Code recognition project first Edition: CAPTCHA1

There are two issues to be aware of crawling:

How to monitor the update of a series of websites, that is, how to do incremental crawling?
How to implement distributed crawling for massive data?

Analysis

After the crawl is the content of the crawl analysis, what you need to extract the relevant content from it.

Common analysis tools include regular expressions, beautifulsoup,lxml, and so On.

Store

After analyzing what we need, the next step is to store it.

We can choose to save to a text file or to a MySQL or MongoDB database.

There are two things to keep in mind about Storage:

How do I make a Web page go heavy?
What form of content is stored?

Scrapy

Scrapy is a twisted-based, open-source Python crawler framework that is widely used in industry.

Related content can refer to the building based on Scrapy network crawler, at the same time, this article introduces the search crawl project code, for everyone as a learning reference.

Reference Project: Crawl search results recursively using scrapy or requests

Python Starter Web Crawler Essentials Edition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More