[Python] Spides

Source: Internet
Author: User
Tags response code domain server

Summary: This article will cover the various exercises and knowledge involved in Python-based crawlers, including HTTP protocols, cookies, etc.

?

Tool fiddler

Python does not use a proxy by default, so fiddler cannot intercept its package. If you want to use fiddler to analyze Python's network access, you need to set up the proxy in Python programming

Python does not use proxy by default, so fiddler cannot intercept its network traffic. Need to change the Python program a little. Here is another the thread on the see Proxy for Python programs.

?

Reference:

[1] Fiddler how to capture HTTP access in Python3

[2] Fiddler-doesnt-capture-python-http-request & Proxy with URLLIB2

[3] Fiddler Tutorial

?

HTTP protocol

HTTP is an abbreviation for the Hyper Text Transfer Protocol (Hypertext Transfer Protocol). is the transfer protocol used to transfer hypertext to the local browser from the WWW server.

Request Response Model for HTTP

This limits the use of the HTTP protocol, which cannot be implemented when the client does not initiate a request, the server pushes the message to the client.

The HTTP protocol is a stateless protocol, and there is no correspondence between this request and the last request of the same client.

Work flow

An HTTP operation is called a transaction, and its working process can be divided into four steps:

1) First the client and the server need to establish a connection. As soon as you click on a hyperlink, the HTTP work begins.

2) After the connection is established, the client sends a request to the server in the form of a Uniform Resource Identifier (URL), protocol version number, followed by MIME information including the request modifier, client information, and possible content.

3) When the server receives the request, it gives the corresponding response information in the form of a status line, including the protocol version number of the information, a successful or incorrect code, followed by MIME information including server information, entity information, and possible content.

4) The information returned by the client receiving server is displayed on the user's display by the browser, and then the client disconnects from the server.

?

If an error occurs in one of the steps above, the information that generates the error is returned to the client, with the display output. For the user, these processes are done by HTTP itself, the user just click with the mouse, waiting for information to display the

?

Data

The following example illustrates a typical message exchange for a GET request  on the URI "http://www.example.com/hello.txt":

?

Client Request:

Get/hello.txt http/1.1

user-agent:curl/7.16.3 libcurl/7.16.3 openssl/0.9.7l zlib/1.2.3

Host:www.example.com

Accept-language:en, MI

?

Server Response:

http/1.1 OK

Date:mon, 12:28:53 GMT

Server:apache

last-modified:wed, 19:15:56 GMT

ETag: "34aa387-d-1568eb00"

Accept-ranges:bytes

Content-length:51

Vary:accept-encoding

Content-type:text/plain

?

Hello world! My payload includes a trailing CRLF.

?

Request Header

The request header allows the client to pass additional information about the request to the server side, as well as the client itself.

Common Request Headers

Accept

The Accept Request header field is used to specify which types of information the client accepts. Eg:accept:image/gif, indicating that the client wants to accept a resource in GIF image format; accept:text/html, indicating that the client wants to accept HTML text.

Accept-charset

The Accept-charset request header field is used to specify the character set accepted by the client. eg:accept-charset:iso-8859-1,gb2312. If the field is not set in the request message, the default is to accept any character set.

Accept-encoding

The Accept-encoding request header field is similar to accept, but it is used to specify acceptable content encoding. Eg:accept-encoding:gzip.deflate. If the domain server is not set in the request message, the client is assumed to be acceptable for various content encodings.

Accept-language

The Accept-language request header field is similar to accept, but it is used to specify a natural language. EG:ACCEPT-LANGUAGE:ZH-CN. If the header field is not set in the request message, the server assumes that the client is acceptable for each language.

Authorization

The authorization request header domain is primarily used to prove that a client has permission to view a resource. When a browser accesses a page, if a response code of 401 (unauthorized) is received from the server, a request containing the authorization request header domain can be sent, requiring the server to validate it.

Host(the header field is required when the request is sent)

The host request header domain is primarily used to specify the Internet host and port number of the requested resource, which is usually extracted from the HTTP URL, eg:

We enter in the browser: http://www.guet.edu.cn/index.html

In the request message sent by the browser, the host Request header field is included, as follows:

Host:www.guet.edu.cn

The default port number 80 is used here, and if a port number is specified, it becomes: Host:www.guet.edu.cn: Specify port number

User-agent

When we go online to the forum, often see some welcome information, which lists the name and version of your operating system, the name and version of the browser you are using, which often makes a lot of people feel amazing, in fact, the server application is from user-agent this request header domain to obtain this information. The User-agent request header domain allows the client to tell the server about its operating system, browser, and other properties. However, this header field is not required, and if we write a browser ourselves without using the User-agent request header domain, then the server side will not be able to know our information.

?

An example of a request header:

Get/form.html http/1.1 (CRLF)

Accept:image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application /vnd.ms-powerpoint,application/msword,*/* (CRLF)

ACCEPT-LANGUAGE:ZH-CN (CRLF)

Accept-encoding:gzip,deflate (CRLF)

if-modified-since:wed,05 Jan 11:21:25 GMT (CRLF)

if-none-match:w/"80b1a4c018f3c41:8317" (CRLF)

user-agent:mozilla/4.0 (compatible; MSIE6.0; Windows NT 5.0) (CRLF)

Host:www.guet.edu.cn (CRLF)

Connection:keep-alive (CRLF)

(CRLF)

?

Response header

The response header allows the server to pass additional response information that cannot be placed in the status line, as well as information about the server and the next access to the resources identified by Request-uri.

Common response Headers

Location

The Location response header field is used to redirect the recipient to a new position. Location response header fields are commonly used when changing domain names.

Server

The server Response header field contains the software information that the server uses to process the request. Corresponds to the User-agent request header field. Below is

An example of the server Response header field:

server:apache-coyote/1.1

Www-authenticate

The www-authenticate response header domain must be included in the 401 (unauthorized) response message, the client receives a 401 response message, and when the authorization header domain is sent to the request server to validate it, the service-side response header contains the header domain.

Eg:www-authenticate:basic realm= "Basic Auth test!"//You can see that the server is using a Basic authentication mechanism for the requested resource.

?

https://www.w3.org/Protocols/#Specs

?

Crawler Programming Simple Crawler

Import Urllib

URL ="http://www.healforce.com/cn/index.php?ac=article&at=read&did=444"

Webpage=urllib. urlopen(url )

Data = webpage. Read( )

?

The code to set up the agent is as follows (easy to fiddler packet analysis)

?

Import Urllib2

URL ="http://www.healforce.com/cn/index.php?ac=article&at=read&did=444"

Proxy = urllib2. Proxyhandler({' http ':' 127.0.0.1:8888 '})

Opener = urllib2. build_opener(proxy )

Urllib2. install_opener(opener )

Webpage=urllib2. urlopen(url )

Data = webpage. Read( )

Print(data )

Print(type(webpage) )

Print(webpage. Geturl() )

Print(webpage. Info() )

Print(webpage. GetCode())

?

>>> Print (Type (webpage))

<type ' instance ' >

>>> print (Webpage.geturl ())

http://www.healforce.com/cn/index.php?ac=article&at=read&did=444

>>> print (Webpage.info ())

Date:thu, 10:38:48 GMT

server:apache/2.4.10 (WIN32) openssl/1.0.1h

Connection:close

Transfer-encoding:chunked

content-type:text/html; Charset=utf-8

>>> print (Webpage.getcode ())

200

?

Using Fiddler to crawl data analysis

1.200 indicates successful access

2. Address of the visit

3. Python-Generated request header

4. In response to the returned HTML, this is the same as the one returned by print(data)

?

?

?

?

?

?

?

[Python] Spides

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.