Summary: This article will cover the various exercises and knowledge involved in Python-based crawlers, including HTTP protocols, cookies, etc.
?
Tool fiddler
Python does not use a proxy by default, so fiddler cannot intercept its package. If you want to use fiddler to analyze Python's network access, you need to set up the proxy in Python programming
Python does not use proxy by default, so fiddler cannot intercept its network traffic. Need to change the Python program a little. Here is another the thread on the see Proxy for Python programs.
?
Reference:
[1] Fiddler how to capture HTTP access in Python3
[2] Fiddler-doesnt-capture-python-http-request & Proxy with URLLIB2
[3] Fiddler Tutorial
?
HTTP protocol
HTTP is an abbreviation for the Hyper Text Transfer Protocol (Hypertext Transfer Protocol). is the transfer protocol used to transfer hypertext to the local browser from the WWW server.
Request Response Model for HTTP
This limits the use of the HTTP protocol, which cannot be implemented when the client does not initiate a request, the server pushes the message to the client.
The HTTP protocol is a stateless protocol, and there is no correspondence between this request and the last request of the same client.
Work flow
An HTTP operation is called a transaction, and its working process can be divided into four steps:
1) First the client and the server need to establish a connection. As soon as you click on a hyperlink, the HTTP work begins.
2) After the connection is established, the client sends a request to the server in the form of a Uniform Resource Identifier (URL), protocol version number, followed by MIME information including the request modifier, client information, and possible content.
3) When the server receives the request, it gives the corresponding response information in the form of a status line, including the protocol version number of the information, a successful or incorrect code, followed by MIME information including server information, entity information, and possible content.
4) The information returned by the client receiving server is displayed on the user's display by the browser, and then the client disconnects from the server.
?
If an error occurs in one of the steps above, the information that generates the error is returned to the client, with the display output. For the user, these processes are done by HTTP itself, the user just click with the mouse, waiting for information to display the
?
Data
The following example illustrates a typical message exchange for a GET request on the URI "http://www.example.com/hello.txt":
?
Client Request:
Get/hello.txt http/1.1
user-agent:curl/7.16.3 libcurl/7.16.3 openssl/0.9.7l zlib/1.2.3
Host:www.example.com
Accept-language:en, MI
?
Server Response:
http/1.1 OK
Date:mon, 12:28:53 GMT
Server:apache
last-modified:wed, 19:15:56 GMT
ETag: "34aa387-d-1568eb00"
Accept-ranges:bytes
Content-length:51
Vary:accept-encoding
Content-type:text/plain
?
Hello world! My payload includes a trailing CRLF.
?
Request Header
The request header allows the client to pass additional information about the request to the server side, as well as the client itself.
Common Request Headers
Accept
The Accept Request header field is used to specify which types of information the client accepts. Eg:accept:image/gif, indicating that the client wants to accept a resource in GIF image format; accept:text/html, indicating that the client wants to accept HTML text.
Accept-charset
The Accept-charset request header field is used to specify the character set accepted by the client. eg:accept-charset:iso-8859-1,gb2312. If the field is not set in the request message, the default is to accept any character set.
Accept-encoding
The Accept-encoding request header field is similar to accept, but it is used to specify acceptable content encoding. Eg:accept-encoding:gzip.deflate. If the domain server is not set in the request message, the client is assumed to be acceptable for various content encodings.
Accept-language
The Accept-language request header field is similar to accept, but it is used to specify a natural language. EG:ACCEPT-LANGUAGE:ZH-CN. If the header field is not set in the request message, the server assumes that the client is acceptable for each language.
Authorization
The authorization request header domain is primarily used to prove that a client has permission to view a resource. When a browser accesses a page, if a response code of 401 (unauthorized) is received from the server, a request containing the authorization request header domain can be sent, requiring the server to validate it.
Host(the header field is required when the request is sent)
The host request header domain is primarily used to specify the Internet host and port number of the requested resource, which is usually extracted from the HTTP URL, eg:
We enter in the browser: http://www.guet.edu.cn/index.html
In the request message sent by the browser, the host Request header field is included, as follows:
Host:www.guet.edu.cn
The default port number 80 is used here, and if a port number is specified, it becomes: Host:www.guet.edu.cn: Specify port number
User-agent
When we go online to the forum, often see some welcome information, which lists the name and version of your operating system, the name and version of the browser you are using, which often makes a lot of people feel amazing, in fact, the server application is from user-agent this request header domain to obtain this information. The User-agent request header domain allows the client to tell the server about its operating system, browser, and other properties. However, this header field is not required, and if we write a browser ourselves without using the User-agent request header domain, then the server side will not be able to know our information.
?
An example of a request header:
Get/form.html http/1.1 (CRLF)
Accept:image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application /vnd.ms-powerpoint,application/msword,*/* (CRLF)
ACCEPT-LANGUAGE:ZH-CN (CRLF)
Accept-encoding:gzip,deflate (CRLF)
if-modified-since:wed,05 Jan 11:21:25 GMT (CRLF)
if-none-match:w/"80b1a4c018f3c41:8317" (CRLF)
user-agent:mozilla/4.0 (compatible; MSIE6.0; Windows NT 5.0) (CRLF)
Host:www.guet.edu.cn (CRLF)
Connection:keep-alive (CRLF)
(CRLF)
?
Response header
The response header allows the server to pass additional response information that cannot be placed in the status line, as well as information about the server and the next access to the resources identified by Request-uri.
Common response Headers
Location
The Location response header field is used to redirect the recipient to a new position. Location response header fields are commonly used when changing domain names.
Server
The server Response header field contains the software information that the server uses to process the request. Corresponds to the User-agent request header field. Below is
An example of the server Response header field:
server:apache-coyote/1.1
Www-authenticate
The www-authenticate response header domain must be included in the 401 (unauthorized) response message, the client receives a 401 response message, and when the authorization header domain is sent to the request server to validate it, the service-side response header contains the header domain.
Eg:www-authenticate:basic realm= "Basic Auth test!"//You can see that the server is using a Basic authentication mechanism for the requested resource.
?
https://www.w3.org/Protocols/#Specs
?
Crawler Programming Simple Crawler
Import Urllib
URL ="http://www.healforce.com/cn/index.php?ac=article&at=read&did=444"
Webpage=urllib. urlopen(url )
Data = webpage. Read( )
?
The code to set up the agent is as follows (easy to fiddler packet analysis)
?
Import Urllib2
URL ="http://www.healforce.com/cn/index.php?ac=article&at=read&did=444"
Proxy = urllib2. Proxyhandler({' http ':' 127.0.0.1:8888 '})
Opener = urllib2. build_opener(proxy )
Urllib2. install_opener(opener )
Webpage=urllib2. urlopen(url )
Data = webpage. Read( )
Print(data )
Print(type(webpage) )
Print(webpage. Geturl() )
Print(webpage. Info() )
Print(webpage. GetCode())
?
>>> Print (Type (webpage))
<type ' instance ' >
>>> print (Webpage.geturl ())
http://www.healforce.com/cn/index.php?ac=article&at=read&did=444
>>> print (Webpage.info ())
Date:thu, 10:38:48 GMT
server:apache/2.4.10 (WIN32) openssl/1.0.1h
Connection:close
Transfer-encoding:chunked
content-type:text/html; Charset=utf-8
>>> print (Webpage.getcode ())
200
?
Using Fiddler to crawl data analysis
1.200 indicates successful access
2. Address of the visit
3. Python-Generated request header
4. In response to the returned HTML, this is the same as the one returned by print(data)
?
?
?
?
?
?
?
[Python] Spides