Python Learning notes-crawler-related network knowledge

Source: Internet
Author: User

1 About URLs

URL (uniform/universal Resource Locator): Uniform Resource Locator, an identifying method for fully describing the addresses of web pages and other resources on the Internet

URLs are the portals of crawlers--very important.

Basic format:

scheme://host[:p ort#]/path/.../[?query-string][#anchor]

Scheme: Protocol (for example: HTTP, HTTPS, FTP)

Host: The IP address or domain name of the server

port#: Server port (protocol default port 80, default)

Path: access to Resource paths

Query-string: Data sent to HTTP server

Anchor: Anchor (jumps to the specified anchor position on the page)

Example:

Http://www.baidu.com

Http://item.jd.com/11963485.html#product-detail

Ftp://192.168.1.118:8081/index

2 HTTP protocol, HTTPS protocol 2.1 HTTP protocol

HTTP protocol (Hypertext Transfer Protocol, Hypertext Transfer Protocol): is a way to publish and receive HTML pages. The HTTP protocol is an application-layer protocol with no connection (only one request per connection), stateless (each connection, transmission is independent)

2.2 HTTPS protocol

The HTTPS (hypertext Transfer Protocol over secure Socket layer) protocol is simply the secure version of HTTP, which joins the SSL layer under HTTP. HTTPS = HTTP+SSL (secure Sockets layer Secure socket) is primarily used for secure transport protocols on the web, encrypting network Connections at the transport layer and securing data transfer over the Internet

Note:

HTTP port number is a , HTTPS port number is 443;

3 HTTP Request Request

HTTP requests are commonly used in two ways:

(1) Get: "Get" is to obtain information from the server, the process of transmitting data to the server is not secure, the data size is limited;

(2) Post: "Send", the transfer of data to the server, the process of transmitting data is safe, the size of theoretically unlimited;

HTTP Other Request method:

4 User-agent User Agent

HTTP header user-agent (UA) is a user agent, is part of the head domain, is a special string header, is a visit to the site to provide you with the browser type and version, operating system and version, browser kernel, and other information identification. Through this logo, users visit the site can display a different layout to provide users with better experience or information statistics, such as mobile phone access to Google and computer access is not the same, these are Google according to the visitor's UA to judge.

The UA can be disguised. That could be used to disguise the crawler.

The standard format of the browser's UA string: The browser identity (operating system identity; encryption level identification; browser language) The rendering engine identifies version information. But each browser is different.

Note: For compatibility and promotion purposes, many browsers have the same identity, so the browser logo does not explain the actual version of the browser, the actual version information can be found in the tail of the UA string.

user-agent:mozilla/5.0 (Windows NT 10.0; ...) Gecko/20100101 firefox/59.0

Network capture can be realized through software Fiddler, Wireshark, etc.

A detailed description of the grab Kit Wireshark Network grab Bag

5 Status Code 2, status code for HTTP response response

Depending on the type of response result, it is broadly divided into the following categories:

2.1 1XX (Information Class)

The type status code indicates that the request was received and continues processing.

    • 100, the client must continue to make the request.
    • 101, the client requires the server to convert the HTTP protocol version on request.
2.2 2XX (response successful)

The type status code indicates that the action was successfully received, understood, and accepted.

    • 200, indicating that the request was successfully completed and that the requested resource was sent to the client.
    • 201, prompt to know the URL of the new file.
    • 202, accepted and processed, but processing is not completed.
    • 203, the returned information is indeterminate or incomplete.
    • 204, the request is received, but the return information is empty.
    • 205, the server completes the request, the user must reset the currently browsed files.
    • 206, the server has completed a partial user get request.
2.3 3XX (redirect Class)

The type status code indicates that further processing must be accepted in order to complete the specified action.

    • 300, the requested resource can be obtained in multiple places.
    • 301, this page is permanently transferred to another URL.
    • 302, the requested page is redirected to the new address.
    • 303, we recommend that users access other URLs or access methods.
    • 304, the requested webpage has not been modified since the last request.
    • 305, the requested resource must be obtained from the address specified by the server.
    • 306, the code used in the previous version of HTTP is now no longer in use.
    • 307, declare the requested resource temporarily deleted.
2.4 4XX (Client error Class)

The type status code indicates that the request contains an error syntax or does not execute correctly.

    • 400, the client request has a syntax error.
    • 401, the request was unauthorized.
    • 402, retain valid Chargeto head response.
    • 403, access is forbidden, the server receives the request, but refuses to provide the service.
    • 404, the server can be connected, but the server cannot get the requested Web page, the request resource does not exist.
    • 405, the method defined by the user in the Request-line field is not allowed.
    • 406, according to the user sent the accept, the request resource is inaccessible.
    • 407, similar to 401, the user must first obtain authorization on the proxy server.
    • 408, the client does not complete the request within the user-specified time.
    • 409, the request cannot be completed for the current resource state.
    • 410, this resource is no longer available on the server.
    • 411, the server rejects the user-defined Content-length property request.
    • 412, one or more request header fields are wrong in the current request.
    • 413, the requested resource is larger than the size allowed by the server.
    • 414, the requested resource URL is longer than the length allowed by the server.
    • 415, the request resource does not support the request item format.
    • 416, the request contains a range request header field that does not have a range indication value within the current request resource.
    • 417, the server does not meet the expected value specified by the request Expect header field.
2.5 5XX (Server error Class)

The Type status code indicates a server or gateway error.

    • 500, server error.
    • 501, the server does not support the requested feature.
    • 502, gateway error.
    • 503, unable to get service.
    • 504, the gateway timed out.
    • 505, the HTTP version is not supported.
3. Response Head

is a limitation of the response and contains many attributes. The commonly used properties are:

    • Location to implement request redirection.
    • Server, basic information about the servers.
    • Content-encoding, the compression format used by the server when sending data.
    • Content-language, the language in which the data is sent.
    • Content-type, the type of data being sent.
    • Content-length, the size of the sent data.
    • Set-cookie, send the Cookie to the client.
    • Last-modified that indicates the last modification date and time of the resource.

Python Learning notes-crawler-related network knowledge

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.