Python crawler preparation and python crawler preparation
1. http programming knowledge
- Working Mode of client and server in http
Establish a reliable tcp link between the client and the server (this link is a long time in HTTP1.1, And the disconnection policy is timed out)
The client communicates with the server through a socket, sends a request, and receives the response
The http protocol is stateless, which means that each request is independent of each other and the client and server do not record the customer's behavior.
The client adds headers to the HTTP request to tell the server the content of the request in an acceptable format.
- Common Request methods include get and post.
Get: the client requests a file.
Post: the client sends data for the server to process
ClassUrllib2.Request (Url [, data] [, headers] [, origin_req_host] [, unverifiable])
URL: it should be a string
Data: A string encoded by urllib. urlencode ().
Headers: Used to spoof user_agent and disguise access from scripts as browser access.
Sample Code:
Import urllib Import urllib2 Url = 'HTTP: // www.someserver.com/cgi-bin/register.cgi' User_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT )' Values = {'name': 'why ', 'Location': 'sdu ', 'Language': 'python '} Headers = {'user-agent': user_agent} Data = urllib. urlencode (values) Req = urllib2.Request (url, data, headers) Response = urllib2.urlopen (req) The_page = response. read () |
Reference blog: http://blog.csdn.net/pleasecallmewhy/article/details/8923067
3. Save the following code in html format and open it in the corresponding browser to obtain the version information of the browser.
User_agent of sogou Browser
User_agent of Baidu Browser
User_agent of Google chorme