Sesame HTTP: Basic Principle of proxy, basic principle of sesame proxy

Source: Internet
Author: User

Sesame HTTP: Basic Principle of proxy, basic principle of sesame proxy

We often encounter this situation in the Process of crawling. At first, crawlers run normally and data is captured normally. Everything looks so beautiful. However, a cup of tea may lead to errors, for example, 403 Forbidden, when you open a webpage, you may see a prompt like "your IP address is too frequently accessed. This is because the website has adopted some anti-crawler measures. For example, the server detects the number of requests from an IP address per unit time. If this threshold is exceeded, the server directly rejects the service and returns some error messages, which can be called an IP address.

Since the server detects the number of requests for an IP address per unit time, it uses some method to disguise our IP address so that the server cannot identify the requests initiated by our local machine, can it prevent IP address blocking?

An effective method is to use a proxy, which will be described in detail later. Before that, you need to first understand the basic principle of proxy. How does proxy implement IP camouflage?

1. Basic Principles

Proxy actually refers to the proxy server, which is called the proxy server in English. Its function is to obtain network information by the proxy network user. It is a transfer station for network information. When we normally request a website, a request is sent to the Web server, and the Web server returns the response to us. If a proxy server is set up, a bridge is actually built between the local server and the server. At this time, the local machine does not directly initiate a request to the Web server, but sends a request to the proxy server, the request is sent to the proxy server, and then sent to the Web server by the proxy server. Then, the proxy server forwards the response returned by the Web server to the local machine. In this way, we can access the webpage normally. However, in this process, the real IP address identified by the Web server is no longer the IP address of our local machine, and thus the IP address disguise is successfully realized, this is the basic principle of proxy.

2. Role of proxy

So what is the role of proxy? The following is a simple example.

  • Break through the access restrictions of its own IP address to access some sites that are not accessible at ordinary times.
  • Access internal resources of some organizations or groups. For example, you can use the free proxy server of the address segment in the education network to download and upload various FTP files open to the education network, and query and share various types of data.
  • Increase access speed: Usually the proxy server sets a large hard disk buffer. When external information passes through, it is also saved to the buffer, when other users access the same information again, the information is directly retrieved from the buffer and transmitted to the user to Improve the access speed.
  • Hide real IP addresses: netusers can also hide their own IP addresses in this way to protect themselves from attacks. For crawlers, we use proxies to hide their own IP addresses and prevent their IP addresses from being blocked.
3. crawler agent

For crawlers, the crawling speed is too fast. During crawling, the same IP address may be accessed too frequently, at this time, the website will allow us to enter the verification code to log on or directly block the IP address, which will cause great inconvenience to crawling.

Use a proxy to hide the real IP address, so that the server mistakenly thinks that the proxy server is requesting itself. In this way, the system won't be blocked by constantly changing the proxy during the crawling process, which can achieve a good crawling effect.

4. Proxy Classification

Proxy classification can be distinguished by protocol or degree of anonymity.

(1) According to the Agreement

Depending on the Protocol of the proxy, the proxy can be divided into the following categories.

  • FTP Proxy Server: It is mainly used to access the FTP server. Generally, it has the upload, download, and cache functions, and ports are generally 21 and 2121.
  • HTTP Proxy Server: It is mainly used to access webpages. It generally provides content filtering and caching functions, and ports are generally 80, 8080, and 3128.
  • SSL/TLS proxy: Used to access encrypted websites. Generally, SSL or TLS encryption is provided (up to 128-bit encryption strength is supported), and the port is generally 443.
  • RTSP Proxy: It is mainly used to access the Real Streaming Media Server. Generally, the cache function is available, and the port is generally 554.
  • Telnet proxy: Used for telnet Remote Control (often used to hide identities when hackers intrude into computers). The port is generally 23.
  • POP3/SMTP proxy: It is mainly used for sending and receiving mails in POP3/SMTP mode. Generally, the cache function is available and the port is generally 110/25.
  • SOCKS proxy: It only transmits data packets and does not care about specific protocols and usage. Therefore, the speed is much faster. Generally, the cache function is available, and the port is generally 1080. SOCKS proxy protocols include SOCKS4 and SOCKS5. The former only supports TCP, the latter supports TCP and UDP, and also supports various authentication mechanisms and server domain name resolution. In simple terms, all SOCKS5 that can be done by SOCK4 can be done, but SOCK4 that can be done by SOCKS5 may not be able to do it.
(2) Based on the degree of anonymity

Depending on the degree of anonymity of the agent, the agent can be divided into the following categories.

  • Highly anonymous proxy: It will forward the data packets intact. It seems to the server that a common client is accessing, and the recorded IP address is the IP address of the proxy server.
  • Normal anonymous proxy: Some changes will be made on the data packets. The server may find that this is a proxy server, and it may also be able to trace the real IP address of the client. The proxy server usually includes the following HTTP headers:HTTP_VIAAndHTTP_X_FORWARDED_FOR.
  • Transparent proxy: Not only changes the data packet, but also tells the real IP address of the server client. This kind of proxy not only improves browsing speed with caching technology, but also improves security with content filtering. The most common example is the hardware firewall in the intranet.
  • Proxy: A proxy server created by an organization or individual to record user transmitted data and perform research, monitoring, and other purposes.
5. Common proxy settings
  • Free proxy on the Internet: it is best to use a high-availability proxy. There are not many available proxies. You need to filter the available proxies before use, or you can further maintain a proxy pool.
  • Use of the paid proxy service: There are many agents on the Internet, which can be paid for use. The quality is much better than that of free agents.
  • ADSL Dialing: one-time IP address change, high stability, is also a relatively effective solution.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.