Python starter Web crawler Essentials Edition
Python Learning web crawler is divided into 3 major sections: crawl , analyze , store
In addition, the more commonly used crawler frame scrapy, here is the final introduction.
First of all, please refer to the relevant reference: Ningo's Small station-web crawler
Crawl
This step, you want to be clear what is the content is? is the HTML source, or the JSON-formatted string, and so on.
1. The most basic crawl
Typically a GET request, data is fetched directly from the server.
First of all, Python with Urllib and urllib2 these two modules, basically can meet the General page crawl. In addition, requests is also a very useful package, similar to this, there are httplib2 and so on.
Requests:ImportRequests response = Requests.get (URL) content = Requests.get (URL). Content# string Print "Response headers:", Response.headers# dict Print "Content:", ContentUrllib2:ImportURLLIB2 response = urllib2.urlopen (URL) content = Urllib2.urlopen (URL). Read ()# string Print "Response headers:", Response.headers# not Dict Print "Content:", ContentHttplib2:ImportHttplib2 http = httplib2. Http () response_headers, content = http.request (URL,' GET ')Print "Response headers:", Response_headers# dict Print "Content:", content
In addition, for Url,get requests with query fields, the data that is requested in the future is usually appended to the URL, in order to split the URL and transfer the data, with & connections for multiple parameters.
Data = {'data1 ': 'XXXXX', 'data2 ': 'XXXXX'} # Dict typeRequests: Data is Dict,json Import requestsResponse = Requests.get (Url=url, params= data) # GET Request SendUrllib2: Data is string Import urllib, Urllib2 Data = Urllib.urlencode(data) # encoding work, converted from dict to stringFull_url = url+ '? ' + Data # GET Request SendResponse = Urllib2.urlopen (Full_url)
2. Handling of anti-crawler mechanisms
2.1 Analogue Landing conditions
This is the case of a POST request, which sends the form data to the server and the server then deposits the returned cookie locally.
Data = {'data1 ': 'XXXXX', 'data2 ': 'XXXXX'} # Dict typeRequests: Data is Dict,json Import requestsResponse = Requests.post (Url=url, Data=data) # POST request Send, can be used for user name password login situationUrllib2: Data is string Import urllib, Urllib2 Data = Urllib.urlencode(data) # encoding work, converted from dict to stringreq = Urllib2.Request(Url=url, Data=data) # POST request Send, can be used for user name password login situationResponse = Urllib2.urlopen (req)
2.2 Use of cookies to log in
Using a cookie to log in, the server will think you are a logged-in user, so it will return you to a logged-in content. Therefore, the need to verify the code can be used with a verification code login cookie resolution.
import requests requests_session = requests.session() # 创建会话对象requests.session(),可以记录cookieresponsedata=data) # requests_session的POST请求发送,可用于用户名密码登陆情况。必须要有这一步!拿到Response Cookie!
If there is a verification code, this time using response = Requests_session.post (Url=url_login, data=data) is not possible, the practice should be as follows:
response_captcha = requests_session.get(url=url_login, cookies=cookies)response1 = requests.get# 未登陆response2 = requests_session.get# 已登陆,因为之前拿到了Response Cookie!response3 = requests_session.get# 已登陆,因为之前拿到了Response Cookie!
Related references: Web crawler-Verification Code login
Reference items: Crawling to the website
2.3 Disguised as a browser, or anti-"anti-hotlinking"
headers = {‘User-Agent‘:‘XXXXX‘# 伪装成浏览器访问,适用于拒绝爬虫的网站headers = {‘Referer‘:‘XXXXX‘# 反“反盗链”,适用于有“反盗链”的网站headers = {‘User-Agent‘:‘XXXXX‘‘Referer‘:‘XXXXX‘}Requests: response = requests.get(url=url, headers=headers)Urllib2: import urllib, urllib2 req = urllib2.Request(url=url, headers=headers) response = urllib2.urlopen(req)
3. Using proxies
Application: Limit the situation of IP address, but also can be resolved due to "frequent click" and need to enter the verification code login situation.
proxies = {‘http‘:‘http://XX.XX.XX.XX:XXXX‘}Requests: import requests response = requests.get(url=url, proxies=proxies)Urllib2: import urllib2 proxy_support = urllib2.ProxyHandler(proxies) opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler) # 安装opener,此后调用urlopen()时都会使用安装过的opener对象 response = urllib2.urlopen(url)
4. For wire break re-connection
def multi_session(session, *arg): whileTrue: 20 while retryTimes>0: try: return session.post(*arg) except: print‘.‘, 1
Or
def multi_open(opener, *arg): whileTrue: 20 while retryTimes>0: try: return opener.open(*arg) except: print‘.‘, 1
This allows us to use multi_session or Multi_open to hold the crawler's session or opener.
5. Multi-Process Crawl
Here's a multi-process experimental comparison of Wall Street stories: Python multi-process fetching and Java multi-process fetching
Related references: Comparison of multi-process multithreaded computing methods for Python and Java
6. Handling of AJAX requests
For "load More" cases, use Ajax to transfer a lot of data. It works by loading the source code of the webpage from the URL of the webpage and executing the JavaScript program in the browser. These programs will load more content, "populate" into the Web page. This is why if you go directly to the URL of the page itself, you will not find the actual content of the page.
Here, if you use Google Chrome to analyze the "Request" link (method: Right → review element →network→ Empty, click "Load More", the corresponding get link appears to find the type of text/html, click, view get parameters or copy request URL), the loop process.
* If there is a page before "request", analyze and deduce the 1th page according to the URL of the previous step. And so on, grab the data that grabs the AJAX address.
* Regular matching of the returned JSON format data (str). In JSON format data, the Unicode_escape encoding from the ' \uxxxx ' form is converted to the Unicode encoding of U ' \uxxxx '.
Reference project: Python multi-process crawl
7. Verification Code Identification
We have three ways to verify the code on our website:
- Use a proxy to update the IP.
- Use cookies to log in.
- Verification code identification.
Use a proxy and use cookies to log in before you have spoken, the following is the verification code identification.
Can utilize the open source TESSERACT-OCR system to carry on the verification code picture the download and the recognition, passes the recognition character to the reptile system to carry on the simulation landing. If it is unsuccessful, you can update the CAPTCHA recognition again until it succeeds.
Reference project: CAPTCHA1
Analysis
After the crawl is the content of the crawl analysis, what you need to extract the relevant content from it.
Common analysis tools include regular expressions, beautifulsoup,lxml, and so on.
Store
After analyzing what we need, the next step is to store it.
We can choose to save to a text file or to a MySQL or MongoDB database.
Scrapy
Scrapy is a twisted-based, open-source Python crawler framework that is widely used in industry.
Related content can be referenced based on the Scrapy network crawler construction
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Python starter Web crawler Essentials Edition