Python web crawler (i): A preliminary understanding of web crawler

Source: Internet
Author: User
Tags response code python web crawler

No matter what reason you want to be a web crawler, the first thing to do first is to understand it.


Before you know the Web crawler, be sure to keep the following 4 points in mind, which is the basis for Web crawlers:


1. Crawl


The urllib of PY is not necessarily to be used, but to learn, if you have not done so. Better alternatives have requests and other third-party more humane, mature library, if pyer do not understand the various libraries, then white learning. The most basic crawl is to pull the page back.


If you go deeper, you will find that you have to face different web requirements, such as certified, different file formats, encoding processing, a variety of strange URL compliance processing, repetitive crawl problems, cookie follow-up problems, multi-threaded multiple process crawl, multi-node crawl, crawl scheduling, resource compression and a series of problems.


So the first step is to pull the page back, slowly you will find a variety of problems for you to optimize.


2. Storage


Catch back general will use a certain strategy to save, rather than direct analysis, personally feel better architecture should be the analysis and crawl separation, more loose, each link out of the problem can isolate another link may appear problems, good troubleshooting update release.


So the file system, Sqlornosql database, memory database, how to save is the focus of this link. You can choose to start the file system and then name it with a certain rule.


3. Analysis


Text Analysis of Web pages, extract links or extract the text or, in short, look at your needs, but must do is to analyze the link. You can use the quickest and most optimal method, such as regular expressions. Then apply the analyzed results to other Links:)


4. Display


If you do a bunch of things, show no output at all, how do you show value? So it's also critical to find good presentation components and show off the muscles.
If you want to make a stand to write a crawler, or you have to analyze the data of something, do not forget this link, better to show the results to others feel.


The definition of web crawler


The web crawler, the spider, is a very vivid name.


The internet is likened to a spider's web, so spiders are crawling around the web.


Web spiders are looking for Web pages through the URL of a Web page.


From one page of the site (usually the homepage), read the contents of the Web page, find the other links in the Web page, and then find the next page through these links, so that the cycle continues until all the pages of this site have been crawled until the end. If the entire Internet as a Web site, then the network spider can use this principle to the Internet all the pages are crawled down.


In this way, the web crawler is a crawling program, a crawl Web page program.


The basic operation of web crawler is to crawl Web pages. So how do you get the page you want?


Let's start with the URL first.


First get the Web page real URL, the simple code is as follows:


From URLLIB2 import Request, Urlopen, Urlerror, Httperror
#导入urllib2模块 and use the request directly without the need for URLLIB2. Request (from ... import ... )
Old_url = ' Http://rrurl.cn/b1UZuP ' #写入网页显示的地址
req = Request (old_url)
Response = Urlopen (req)
print ' old URL: ' + old_url
print ' Real URL: ' + response.geturl ()


Run this code, will be reported httperror:403 error, indicating that the site refused to network crawler access. The HTTP status codes are listed below:


HTTP status codes are usually divided into 5 types, starting with a five-digit, 3-bit integer:


------------------------------------------------------------------------------------------------
200: Request Successful processing: Get the content of the response, processing
201: The request is complete, and the result is a new resource was created. The URI of the newly created resource can be processed in the response entity: The crawler will not encounter
202: The request is accepted, but processing has not completed processing: blocking wait
204: The server has implemented the request, but no new information is returned.    If the customer is a user agent, you do not need to update your own document view for this. Processing mode: Discard
300: The status code is not used directly by the http/1.0 application, just as the default interpretation of the 3XX type response.    There are multiple requested resources available. Processing mode: If the program can be processed, then further processing, if the program can not be processed, then discarded
301: The requested resource is assigned a permanent URL so that it can be accessed in the future through the URL: Redirect to the assigned URL
302: Requested resource is temporarily saved at a different URL processing mode: Redirect to temporary URL
304 The requested resource is not updated for processing: Discard
400 Illegal request processing mode: Discard
401 Unauthorized Handling: Discard
403 Prohibited Handling: Discard
404 No Processing found: Discard
5XX response code starting with "5" status code indicates that the server side found itself error, cannot continue to execute request processing mode: Discard


What do we do at this time? In fact, it is very simple, so that the crawler disguised as normal IP access to the site can be solved. The code is as follows:


---------------------------------------


Program: Twxs crawler
Version: 0.1
Playful Little God
Date: 2015-07-29
Language: Python 2.7
Function: Output The real URL of the site
---------------------------------------
Import Urllib
Import Urllib2
#导入urllib, Urllib2 module, do not recommend the use from ... import ...
Old_url = ' http://www.zhubajie.com/wzkf/th1.html '
User_agent = ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '
#设置初始值old_url, User_agent
#User-agent: Some servers or proxies will use this value to determine whether the request is made by the browser, where User-agent is set to disguise as a browser
Values = {' name ': ' Michael Foord ',
' Location ': ' Northampton ',
' Language ': ' Python '}
headers = {' User-agent ': user_agent}
#初始化操作
data = Urllib.urlencode (values)
req = Urllib2. Request (Old_url, data, headers=headers)
#客户端向服务器发送请求
Response = Urllib2.urlopen (req)
#服务器相应客户端的请求
print ' old URL: ' + old_url
print ' Real URL: ' + response.geturl ()


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Python web crawler (i): A preliminary understanding of web crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.