I. First recognized PID Controller
Winter folks like to heat up. The common situation is that four people are surrounded by Mahjong tables, with a pot of carbon at the bottom of the table. Some people think the fire is not big enough. Add some charcoal, not big enough. After a while, I felt that the fire was too big and my feet were almost cooked. Then I took out some wood Carbon ...... Until the fire in the basin is just right. In such a seemingly simple scenario, four main processes of the PID control system are involved: set the target, measure, compare and execute. Combining the four processes of PID controller to reelaborate the above situation is like this: the most appropriate temperature we expect when people get fired is our set goal; everyone is a sensor, we can perceive the temperature (although only fuzzy sensing)-this is a measurement process; the temperature we perceive is best for us, whether it is big or small-this is a comparative process. After comparison, we know whether the fire is big or small. If it is small, we will add some charcoal, when it is big, extract part of the charcoal-this is an execution process. Repeat the last three processes (as shown in 1) until people can feel the appropriate temperature.
Figure 1 PID Control Process Loop
Ii. Speed Control of Web Crawlers
Why is it as fast as a high-speed train to control crawler speed? There are two reasons: 1. It is a courtesy for the target website. If the speed is too fast, it will cause a heavy load on the server; 2. For me, if the crawling speed is too fast, access to the server is forbidden, and a large amount of valid data is lost. You even need to crawl the data again. So how can we control the crawling speed of a Web page so that it is not too fast?
Generally, a large latency wait is set between each page capture to limit the maximum access frequency. This ensures that the server is not overloaded, this method will not be blocked by the server because of too frequent access. However, this method will lead to low Network Utilization and slow crawling speed. for a large number of Web Page crawling tasks, it is often intolerable.
Figure 2 Comparison of Web Page capture time when network traffic is smooth and network traffic is poor
Figure 2 is a simplified and ideal model, which can be used to illustrate this problem. Assume that a website allows a maximum access frequency of 6 pages/minute, so the minimum interval is 10 s, this time should be determined through repeated trials (the website administrator will not tell you ). When the network is smooth, the reading time of each webpage is 0.5 s. To ensure that the webpage is not blocked by the server, you must wait at least s to capture the next page. The 9.5s interval is fixed, even if the network is poor, it takes such a long time. When the network is poor, the webpage reading time is 9.5 S, and the latency is 9.5 S. Therefore, the reading time of each webpage is changed to 19 s, which is almost twice that of when the network is smooth. In fact, in the ideal situation where the network is poor, you only need to wait 0.5 s, then the same capture speed as when the network is smooth, it can be seen that this method to limit the maximum speed is very inefficient. In addition, the influence of the latency wait time on the capture frequency is very vague. When the latency is 1 s, the capture frequency is 100 pages/minute. Is the latency 10 s 10 pages/minute? It is hard to determine, especially in complex network environments.
To improve the above method, a natural solution is to make the wait time dynamically change, that is, equal to the minimum interval minus the webpage read time, this ensures that the average webpage capture time is the minimum time interval when the network flow is smooth and the network is poor. This method may be feasible for single-threaded crawlers to access small-scale websites. However, when multi-threaded distributed crawlers access large-scale websites, the overall crawling time is determined by many parallel crawling tasks, in addition, various exceptions (invalid pages or connection times out) Make the capture time more computation. This method is rather clumsy. Considering a variety of factors, we obviously need a fuzzy method that does not require precise calculation to control the crawling speed, and the speed is very intuitive to the frequency (page/minute) -PID control algorithm is one of them.
Iii. PID control crawling speed
The PID controller controls the crawling speed simply by increasing the speed, latency, and speed, reduce the delay time-have you seen the similarity between it and the fire condition described at the beginning of this article? Similar to the fire, I imagine the PID's solution for controlling crawler speed is:
1) initialization: Set the initial delay timeT0 and proportional coefficientKP (Classic-0.05)
2) target setting: Set the crawler speed, for example, 40 pages/minute.
3) Measurement: count the number of web pages crawled by crawlers per minute, which may be 32, or 100.
4) compare: Compare the size of N and S
5) Execution: If n is greater than S, it means that it is too fast, so it increases the delay time. If n is smaller than S, it means that it is too slow, so it reduces the delay time.
The formula of this scheme is as follows,
TK =TK-1 +KP * (S-N)(3.1)
K = 1, 2, 3 ...,TK is the time delay set for the k-th time. Do not be intimidated by expressions. 5) the execution process described is too fast (S-N is less than 0,KP * (S-N) is positive), increase latency (TK is greaterTK-1); slow (S-N greater than 0,KP * (S-N) is negative) to reduce latency (TK lessTK-1 ).
Assume the initial delay timeT0 is 1.0 s, proportional coefficientKP is-0.05, And the crawler speed S is set to 40 pages/minute. If the number of web pages crawled by a crawler is n = 100, the latency value calculated based on formula 3.1 isT1 =T0 +KP * (S-N) = 1.0 + (-0.05) * (40-100) = 4.0. The latency value calculated for the next possible measured value n = 30T2 = 4.0-0.05*10 = 3.5.
Figure 3 PID Control Curve
Figure 3 shows the PID control curve in a test. The data below is a detailed record of this experiment. The first column ([1:31:00]) is the current time, and the second column (80) the number of web pages crawled by crawlers in the previous minute is N, and the third column (3799) is based on the formula 3.1 (an integral item is added to the actual computation, which will be described in detail later) after calculating the latency (unit: MS), The crawler speed is eventually stable at 40 pages/minute (certain fluctuations are allowed)
[1:31:00] 80 3799
[1:32:00] 32 4039
[1:33:00] 30 3980
[1:34:00] 30 3720
[1:35:00] 32 3400
[1:36:00] 36 3200
[1:37:00] 36 2920
[1:38:00] 42 2980
[1:39:00] 40 2980
[1:40:00] 40 2980
[1:41:00] 40 2980
[1:42:00] 40 2980
[1:43:00] 40 2980
[1:44:00] 40 2980
[1:45:00] 40 2980
[1:46:00] 40 2980
[1:47:00] 40 2980
[1:48:00] 40 2980
[1:49:00] 39 2910
[1:50:00] 40 2910
[1:51:00] 41 2980
[1:52:00] 39 2930
[1:53:00] 41 3000
[1:54:00] 40 3000
[1:55:00] 39 2930
With the PID controller, It is very convenient for crawlers to capture several webpages every minute within the scope permitted by the network environment?
Iv. Complete PID Controller
Formula 3.1 is a simplified PID controller that uses only proportional items (P) and is a complete PID Controller after integral items (I) and differential items (d) are added. PID is the abbreviation of three English words, representing proportion, integral and differential. Each unit corresponds to a product factor kP, ki, KD. After adding the integral unit and the differential unit, formula 3.1 is extended to formula 4.1. sum is the integral of the error item (S-N) (actually the sum of all error values ), diff is the differentiation of the error items (within the unit time is the difference between the first and second error items)
TK =TK-1 +KP * (S-N) +KI * sum +KD * diff(4.1)
The settings of product factors KP, KI, and KD are related to specific application scenarios, and the optimal values must be determined through experiments. IfKI = 0,KD = 0, then formula 4.1 is formula 3.1.
V. Summary
The experimental data in section 3 is inKP =-0.05,KI =-1, 0.01,KD = 0, which is obtained under the condition that some application details are processed, such as limiting the maximum and minimum latencies and limiting the maximum values of the error credits. If you need to have a better understanding of the PID controller, such as the convergence and robustness of the PID algorithm, please refer to the relevant information. This article focuses more on how to apply the PID controller.