In the previous blog, I introduced the Linux Web page capture instance. In this example, the proxy server is used to capture webpages outside of Google Play.
Proxy Purpose
In fact, in addition to capturing foreign web pages, IP proxy is required, and proxy is also used in many scenarios:
- Access some foreign websites through proxy, bypassing websites filtered out by the firewall of a certain country
- Using the proxy server of CERNET, you can access the internal website resources of a university or research institute.
- You can set up a proxy to download the cache from the proxy server and then send the request back to the local device to increase the access speed.
- When a hacker launches an attack, he can use multiple proxies to hide the IP address of the Local Machine and avoid being tracked (of course, the magic is always high, and eventually it will be traced)
Principle of proxy
The principle of proxy service is that the data requested by the local browser is not directly sent to the website server (Web Server)
Instead, they are completed through an intermediate proxy, such:
IP proxy filtering system
Problem Analysis
- Because it is impossible to traverse and test IP addresses of 2 ^ 32 orders of magnitude around the world every day, to see which IP address is available, the primary task is to find the proxy IP source to be selected?
- I have preliminarily determined the source of the proxy IP address to be selected. How can I determine whether each of these IP addresses is actually available?
- In what format is the source of the proxy IP address to be selected saved? Is text preprocessing required?
- After selecting and confirming that a proxy IP address is available, the download process may suddenly fail. How can I continue to capture the remaining web pages?
- If you re-select an available proxy IP address to capture the remaining web pages, you need to update it to the 12-country grabbing script to facilitate next use. How can this problem be achieved?
- As mentioned in the previous blog, you need to use a proxy IP address to download webpages when capturing game ranking pages and game pages. If the proxy IP address suddenly becomes invalid, how can this problem be solved?
- If a proxy IP address does not expire, But it crawls a webpage very slowly or very slowly, it cannot complete the web page capture tasks in the corresponding country within 24 hours, what should I do? Do I need to filter a new one faster?
- What if no proxy IP address is available after all proxy IP sources are filtered? Continue filtering again or multiple times, or find a new proxy IP source?
When analyzing and solving a practical problem, you will encounter various problems. Some problems are even hard to come up with at the beginning of the solution design (such as the slow speed of crawling web pages by proxy IP addresses). My experience is as follows: practice is more important than pure theory!
Solution Design
Overall Thinking: search for and narrow down the IP proxy source -- check whether the proxy IP is available -- Record the IP available and capture the web page -- Re-filter if the proxy IP is faulty -- continue to capture the web page -- complete
1. IP proxy Source
There are two principles to choose: available and free. After in-depth research and search, it is determined that the IP proxy of the two websites is more reliable: freeproxylists.net and xroxy.com
Considering the number of countries, number of IP proxies, availability of IP proxies, and IP proxy text format, the IP proxy source is mainly selected from the former, and the latter is used as a supplement, later practical tests showed that the primary election scheme basically met the requirements.
2. Text preprocessing
Proxy IP obtained from freeproxylists.net, including IP address, port, type, anonymity, country... and so on, but we only need IP + port, so we need to pre-process the text of the primary IP proxy source.
Text space processing command:
Sed-e "s/\ s \ {2, \}/:/g" $ file_input> $ file_split
Sed-I "S //:/g" $ file_split
Merge proxy IP (IP: Port) command:
Proxy_ip = $ (echo $ Line | cut-F 1-d ":")
Proxy_port = $ (echo $ Line | cut-F 2-D ":")
Proxy = $ proxy_ip ":" $ proxy_port
3. Check IP proxy
After the text preprocessing proxy IP address is in the standard format (IP: port), you need to perform a proxy IP address screening test, check which ones are available and which ones are unavailable (because the obtained IP proxy sources are unavailable or download is too slow, filter them out)
Curl Capture web page check IP proxy available command:
Cmd = "curl-Y 60-Y 1-MB 300-x $ proxy-o $ file_html $ index $ url_html"
$ Cmd
4. Save IP proxy
Checks whether a proxy IP address is available. If yes, save it.
To determine whether a proxy IP is available, you can determine whether the webpage ($ file_html $ index) downloaded in step 3 has content. The specific command is as follows:
If [-E./$ file_html $ Index]; then
Echo $ proxy> $2
Break;
Fi
5. IP proxy crawling web pages
Use the proxy IP address saved in Step 4 to capture webpages and the ranking webpages and game webpages in 12 countries using the proxy IP address. The specific command is as follows:
Proxy_cmd = "curl-Y 60-Y 1-M 300-x $ proxy-o $ proxy_html $ proxy_http"
$ Proxy_cmd
6. IP proxy faults
There are multiple IP proxy faults. Several issues have been listed in the above problem analysis. The detailed analysis is as follows:
A. the proxy IP address suddenly becomes invalid while crawling the webpage.
B. The agent IP address is not invalid, but the webpage capture is slow. The web page capture cannot be completed within 24 hours a day, resulting in the failure to generate daily game ranking reports.
C. All proxy IP addresses are invalid. The web page capture task of the day cannot be completed after the round robin detection is performed once or multiple times.
D. Due to the congestion of the entire network route, the proxy IP captures the webpage slowly or cannot be crawled. If all the proxy IP addresses fail due to misjudgment, how can we recover and correct them?
7. Re-check IP proxy
In the process of Web Page capturing, in the face of the IP proxy failure in step 6, designing a reasonable and efficient proxy IP capture recovery mechanism is the core and key of the entire IP proxy screening system.
The Round Robin filtering process for fault recovery is as follows:
Pay attention to the following points during the process:
A. First, check the previous IP proxy. This is because the previous (yesterday) IP proxy has completed all web page capture tasks, and its availability probability is relatively high. Therefore, the priority is given to whether it is available today. If not, select another
B. If the previous proxy IP address is unavailable today, re-traverse the detection proxy IP source. Once it is detected that it is available, it will not go through the loop, update the available IP proxy and save it to the IP source location, so that you can traverse it next time.
C. If the newly selected proxy IP address in process B suddenly becomes invalid or the network speed is too slow, filter the proxy IP address after B at the IP source location recorded by B. If available, continue to capture the web page; if not available, traverse the entire IP source again
D. If the whole proxy IP source is traversed again and no proxy IP address is available, the whole proxy IP source is repeatedly traversed, until a proxy IP address is available or 24 o'clock today (that is, no available proxy IP address is found all day today)
E. If all the proxy IP addresses in process D are invalid and no available proxy IP addresses can be found throughout the day, the web page capture on the current day cannot be completed. Before restarting the web page capture master control script in the early morning of the next day, it is necessary to first kill the cyclic process d in the background to prevent the two background web page crawlers running simultaneously today and the next day (equivalent to two asynchronous background crawling processes ), this results in obsolete or incorrect ranking data on captured webpages, and occupied network speed bandwidth. To achieve the killing of the zombie background crawling process on the day, please refer to the previous blog Linux grabbing web page instance -- "Automated Master Control script --" kill_curl.sh script, the principle is kill-9 process number, the key script code is as follows:
While [! -Z $ (PS-Ef | grep curl | grep-V grep | cut-C 9-15)]
Do
PS-Ef | grep curl | grep-V grep | cut-C 15-20 | xargs kill-9
PS-Ef | grep curl | grep-V grep | cut-C 9-15 | xargs kill-9
Done
8. Complete web page capture
Through the above IP proxy filtering system, the free proxy IP addresses available in 12 countries are filtered out to complete the daily ranking of 12 countries and the task of capturing game webpages.
After that, the game attribute information on the Web page is extracted and processed to generate daily reports, regular mail sending, and trend chart queries. For details, see my previous blog: Linux Web page capture instance.
Script implementation
The basic process of IP proxy filtering is relatively simple. The data format and implementation steps are as follows:
First, go to the freeproxylists.net website to collect available proxy IP sources (take the United States as an example). The format is as follows:
Then, clear spaces. For specific implementation commands, see [solution design] -- "[2. Text preprocessing]. The format after text preprocessing is as follows:
Then, to test whether the proxy IP address after text preprocessing is available, see [solution design] -- "[3. Check IP proxy] above. The format after detecting the proxy IP address is as follows:
The following describes how to use shell scripts to pre-process text and filter webpages.
1. Text preprocessing
# file processlog='Top800proxy.log'dtime=$(date +%Y-%m-%d__%H:%M:%S)function select_proxy(){ if [ ! -d $dir_split ]; then mkdir $dir_split fi if [ ! -d $dir_output ]; then mkdir $dir_output fi if [ ! -e $log ]; then touch $log fi echo "================== Top800proxy $dtime ==================" >> $log for file in `ls $dir_input`; do echo $file >> $log file_input=$dir_input$file echo $file_input >> $log file_split=$dir_split$file"_split" echo $file_split >> $log rm -rf $file_split touch $file_split sed -e "s/\s\{2,\}/:/g" $file_input > $file_split sed -i "s/ /:/g" $file_split file_output=$dir_output$file"_out" echo $file_output >> $log proxy_output "$file_split" "$file_output" echo '' >> $log done echo '' >> $log}
Script Function Description:
If statement, judge and create the folders $ dir_split and $ dir_output used to save the results in the Processing IP source. The former stores the text format after text preprocessing in [Script Function implementation, the latter saves the available proxy IP addresses after detection
The sed-e Statement modifies multiple spaces in the input text (Figure 1 of the script implementation) to one character ":"
Sed-I statement to further convert unnecessary spaces in the text into a character ":"
Save the intermediate conversion results to the folder $ dir_split.
The following three lines of file_output are passed to the proxy IP detection function (proxy_output) in the format of "$ file_split" as file parameters to filter out available proxy IP addresses.
2. Proxy IP filtering
index=1file_html=$dir_output"html_"cmd=''function proxy_output(){ rm -rf $2 touch $2 rm -rf $file_html* index=1 while read line do proxy_ip=$(echo $line | cut -f 1 -d ":") proxy_port=$(echo $line | cut -f 2 -d ":") proxy=$proxy_ip":"$proxy_port echo $proxy >> $log cmd="curl -y 60 -Y 1 -m 300 -x $proxy -o $file_html$index $url_html" echo $cmd >> $log $cmd if [ -e ./$file_html$index ]; then echo $proxy >> $2 break; fi index=`expr $index + 1` done < $1 rm -rf $file_html*}
Script Function Description:
Proxy IP filtering function proxy_output clears the previous filtering results in the first three rows, which is used to initialize
The while loop mainly traverses the "$ file_split" after text preprocessing passed in as a parameter to check whether the proxy IP address is available. The steps are as follows:
A. First, splice the proxy IP address (IP: Port) format. The implementation is to split the text line by cut, and then extract the first field (IP) and the second field (port), spliced into (IP: Port)
B. Use curl to construct the command cmd to capture webpages and run the command $ cmd to download webpages.
C. check whether a webpage download file is generated after the webpage download command is executed to determine whether the spliced proxy IP ($ proxy) is valid. If valid, save the proxy IP address to "$ file_output" and exit the traversal (break)
D. If the current proxy IP address is invalid, read the next proxy IP address and continue detection.
Proxy IP Capture web page instance:
Use the proxy IP system above to filter out free proxy IP addresses and capture the following example of the game ranking webpage (script segment ):
index=0 while [ $index -le $TOP_NUM ] do url=$url_start$index$url_end url_cmd='curl -y 60 -Y 1 -m 300 -x '$proxy' -o '$url_output$index' '$url echo $url_cmd date=$(date "+%Y-%m-%d___%H-%M-%S") echo $index >> $log echo $url"___________________$date" >> $log $url_cmd # done timeout file seconds=0 while [ ! -f $url_output$index ] do sleep 1 echo $url_output$index"________________no exist" >> $log $url_cmd seconds=`expr $seconds + 1` echo "seconds____________"$seconds >> $log if [ $seconds -ge 5 ]; then select_proxy url_cmd='curl -y 60 -Y 1 -m 300 -x '$proxy' -o '$url_output$index' '$url seconds=0 fi done index=`expr $index + 24` done
Script Function Description:
The preceding shell script code snippets are used to capture webpages. The core line is select_proxy.
The role of the proxy IP address is described above. The current IP address suddenly becomes invalid, the web page is too slow to be crawled, All proxy IP addresses are invalid, or the web page capture tasks of the day cannot be completed, which is used to filter the proxy IP address again, restore a piece of core code captured by the web page
Its design implementation process, such as the above [solution design] -- "[7. Re-check IP proxy]. The implementation principle can refer to the above [proxy IP Screening] script, do not post the source script code here