Understanding web crawlers and Web Crawlers

Last Update:2015-10-26 Source: Internet

Author: User

Tags response code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Understanding web crawlers and Web Crawlers

No matter why you want to make a web crawler, the first thing you do is to understand it.

Before learning about web crawlers, remember the following four key points:

1. Capture

The urllib of py may not be used, but it should be used if you have never used it. Better alternatives include third-party libraries such as requests, which are more user-friendly and mature. If pyer does not know about various libraries, it will be learned in vain. The most basic way to capture a webpage is to pull it back.

If you do this in depth, you will find that you have to deal with different web page requirements, such as authenticated, different file formats, and encoding, various strange issues such as url compliance processing, repeated crawling, Cookie following, multi-thread multi-process crawling, multi-node crawling, crawling and scheduling, and resource compression.

So the first step is to pull the webpage back and you will find various problems to be optimized.

2. Storage

A certain policy is usually used to save the analysis instead of directly analyzing it. I personally think a better architecture should be to separate the analysis and capture, which is more loose, if a problem occurs in each link, you can isolate the problems that may occur in another link, so that you can troubleshoot or update and release the link.

How to store file systems, SQLorNOSQL databases, and memory databases is the focus of this process. You can choose to start by saving the file system and name it with certain rules.

3. Analysis

Analyze the text of a webpage, extract the link, or extract the body. In short, You need to analyze the link. You can use the fastest and optimal method, such as regular expressions. Then apply the analysis results to other links :)

4. Display

If you have done a bunch of things, there is no display output at all. How can you show the value? So finding a good display component is also the key to show the muscles.
If you want to write Crawlers for a website or analyze the data of something, do not forget this link to better present the results to others.

Web Crawler Definition

Web crawlers, that is, Web Spider, are an image name.

Comparing the Internet to a Spider, a Spider is a web crawler.

Web crawlers search for Web pages based on their link addresses.

Starting from a website page (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next Webpage through these link addresses. This keeps repeating, until all the web pages of the website are crawled. If the whole Internet is regarded as a website, the web spider can use this principle to capture all the web pages on the Internet.

In this case, web crawler is a crawling program and a crawling program.

The basic operation of Web Crawlers is to capture webpages. So how can we get the page we want as we like?

Start with the URL.

First, obtain the real url of the webpage. The simple code is as follows:

123456789 from urllib2 import Request, urlopen, URLError, HTTPError# Import the urllib2 module and directly use the Request. urllib2.Request (from... import...) is not required ...)old_url = 'http://rrurl.cn/b1UZuP' # Write the address displayed on the webpagereq = Request(old_url) response = urlopen(req) print 'Old url :' + old_url print 'Real url :' + response.geturl()

When you run this code, the following error occurs: HTTPError: 403, indicating that the website rejects web crawler access. HTTP status codes are listed below:

HTTP status codes are generally divided into 5 types, with 1 ~ It starts with five digits and consists of three integers:

Bytes ------------------------------------------------------------------------------------------------
200: Successful request processing method: Obtain the response content for processing
201: the request is complete. The result is that a new resource is created. The URI of the newly created resource can be processed in the response object.
202: the request is accepted, but the processing has not been completed. Processing Method: Blocking wait
204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Processing Method: discard
300: this status code is not directly used by HTTP/1.0 applications, but is used as the default explanation for 3XX type responses. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded.
301: all requested resources will be allocated with a permanent URL. In this way, you can use this URL to access this resource in the future. Processing Method: redirect to the allocated URL.
302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to a temporary URL.
304 request resource not Updated Handling Method: discard
400 Illegal Request Handling Method: discard
401 unauthorized handling method: discard
403 Forbidden processing method: discard
404 no handling method found: discard
The status code starting with "5" in the 5XX response code indicates that the server finds an error and cannot continue to execute the request processing method: discard

What should we do at this time? In fact, it is very easy to make crawlers pretend to be normal IP addresses to access websites. The Code is as follows:

1234567891011121314151617181920212223242526272829 --------------------------------------- Program: twxs Crawler Version: 0.1 Author: playful little Gods Date: Programming Language: Python 2.7 Function: outputs the real url of the site.--------------------------------------- import urllibimport urllib2# Import the urllib and urllib2 modules. We do not recommend using the from... import...old_url = 'http://www.zhubajie.com/wzkf/th1.html'user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'# Set the initial values of old_url and user_agent# User-Agent: Some servers or proxies use this value to determine whether a request is sent by a browser. Here, the User-Agent is set to disguise as a browser.values = {'name' : 'Michael Foord', 'location' : 'Northampton', 'language' : 'Python' }headers = { 'User-Agent' : user_agent }# Initializationdata = urllib.urlencode(values)req = urllib2.Request(old_url, data, headers=headers)# The client sends a request to the serverresponse = urllib2.urlopen(req)# Server Client Requestsprint 'Old url :' + old_url print 'Real url :' + response.geturl()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More