Understanding web crawlers and Web Crawlers
No matter why you want to make a web crawler, the first thing you do is to understand it.
Before learning about web crawlers, remember the following four key points:
1. Capture
The urllib of py may not be used, but it should be used if you have never used it. Better alternatives include third-party libraries such as requests, which are more user-friendly and mature. If pyer does not know about various libraries, it will be learned in vain. The most basic way to capture a webpage is to pull it back.
If you do this in depth, you will find that you have to deal with different web page requirements, such as authenticated, different file formats, and encoding, various strange issues such as url compliance processing, repeated crawling, Cookie following, multi-thread multi-process crawling, multi-node crawling, crawling and scheduling, and resource compression.
So the first step is to pull the webpage back and you will find various problems to be optimized.
2. Storage
A certain policy is usually used to save the analysis instead of directly analyzing it. I personally think a better architecture should be to separate the analysis and capture, which is more loose, if a problem occurs in each link, you can isolate the problems that may occur in another link, so that you can troubleshoot or update and release the link.
How to store file systems, SQLorNOSQL databases, and memory databases is the focus of this process. You can choose to start by saving the file system and name it with certain rules.
3. Analysis
Analyze the text of a webpage, extract the link, or extract the body. In short, You need to analyze the link. You can use the fastest and optimal method, such as regular expressions. Then apply the analysis results to other links :)
4. Display
If you have done a bunch of things, there is no display output at all. How can you show the value? So finding a good display component is also the key to show the muscles.
If you want to write Crawlers for a website or analyze the data of something, do not forget this link to better present the results to others.
Web Crawler Definition
Web crawlers, that is, Web Spider, are an image name.
Comparing the Internet to a Spider, a Spider is a web crawler.
Web crawlers search for Web pages based on their link addresses.
Starting from a website page (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next Webpage through these link addresses. This keeps repeating, until all the web pages of the website are crawled. If the whole Internet is regarded as a website, the web spider can use this principle to capture all the web pages on the Internet.
In this case, web crawler is a crawling program and a crawling program.
The basic operation of Web Crawlers is to capture webpages. So how can we get the page we want as we like?
Start with the URL.
First, obtain the real url of the webpage. The simple code is as follows:
?
123456789 |
from urllib2 import Request, urlopen, URLError, HTTPError # Import the urllib2 module and directly use the Request. urllib2.Request (from... import...) is not required ...) old_url = 'http://rrurl.cn/b1UZuP' # Write the address displayed on the webpage req = Request(old_url) response = urlopen(req) print 'Old url :' + old_url print 'Real url :' + response.geturl() |
When you run this code, the following error occurs: HTTPError: 403, indicating that the website rejects web crawler access. HTTP status codes are listed below:
HTTP status codes are generally divided into 5 types, with 1 ~ It starts with five digits and consists of three integers:
Bytes ------------------------------------------------------------------------------------------------
200: Successful request processing method: Obtain the response content for processing
201: the request is complete. The result is that a new resource is created. The URI of the newly created resource can be processed in the response object.
202: the request is accepted, but the processing has not been completed. Processing Method: Blocking wait
204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Processing Method: discard
300: this status code is not directly used by HTTP/1.0 applications, but is used as the default explanation for 3XX type responses. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded.
301: all requested resources will be allocated with a permanent URL. In this way, you can use this URL to access this resource in the future. Processing Method: redirect to the allocated URL.
302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to a temporary URL.
304 request resource not Updated Handling Method: discard
400 Illegal Request Handling Method: discard
401 unauthorized handling method: discard
403 Forbidden processing method: discard
404 no handling method found: discard
The status code starting with "5" in the 5XX response code indicates that the server finds an error and cannot continue to execute the request processing method: discard
What should we do at this time? In fact, it is very easy to make crawlers pretend to be normal IP addresses to access websites. The Code is as follows:
?
1234567891011121314151617181920212223242526272829 |
--------------------------------------- Program: twxs Crawler Version: 0.1 Author: playful little Gods Date: Programming Language: Python 2.7 Function: outputs the real url of the site. --------------------------------------- import urllib import urllib2 # Import the urllib and urllib2 modules. We do not recommend using the from... import... old_url = 'http://www.zhubajie.com/wzkf/th1.html' user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' # Set the initial values of old_url and user_agent # User-Agent: Some servers or proxies use this value to determine whether a request is sent by a browser. Here, the User-Agent is set to disguise as a browser. values = { 'name' : 'Michael Foord' , 'location' : 'Northampton' , 'language' : 'Python' } headers = { 'User-Agent' : user_agent } # Initialization data = urllib.urlencode(values) req = urllib2.Request(old_url, data, headers=headers) # The client sends a request to the server response = urllib2.urlopen(req) # Server Client Requests print 'Old url :' + old_url print 'Real url :' + response.geturl() |