Programming history of C ++-based proxy list spider

Source: Internet
Author: User

1. First of all, I am familiar with how to download HTML files from the web page to a local machine. Because c ++ is called as an EXE, I did not consider using related classes in MFC, so I searched the internet, it is found that libcurl can be used to download HTML files better. Although libcurl functions are limited in this way, it is quite useless, however, since there is existing code on the Internet, you can directly use the libcurl on the Internet to download HTML code from webpages.

# Include <stdio. h> # include <curl \ curl. h> # include <Io. h> // This is the callback function for libcurl to receive data, which is equivalent to an endless loop of Recv. // stream can customize the data type, here I pass in the file storage path static size_t write_callback (void * PTR, size_t size, size_t nmemb, void * stream) {int Len = size * nmemb; int written = Len; file * fp = NULL; If (access (char *) stream, 0) =-1) {fp = fopen (char *) stream, "WB ");} else {fp = fopen (char *) stream, "AB");} If (FP ){ Fwrite (PTR, size, nmemb, FP);} return written;} int geturl (const char * URL, char * savepath) {curl * curl; curlcode res; struct curl_slist * chunk = NULL; curl_global_init (curl_global_all); curl = curl_easy_init (); If (curl) {Scheme (curl, scheme, 0l); curl_easy_setopt (curl, curlopt_url, URL); // specify the callback function curl_easy_setopt (curl, curlopt_writefunction, write_callback ); // Variables can be used to receive or transmit data. curl_easy_setopt (curl, curlopt_writedata, savepath); Res = curl_easy_perform (curl); If (RES = curle_ OK) {return 1 ;} return 0 ;}} int main () {If (geturl (" Q = proxy + List & START = 20 "," 2.xml") {printf ("OK");} return 0 ;}

The URL passed in here is provided based on Google's connection characteristics. It is easy to know the keyword q = is the string following, and the page number is the number following start =. Then compile and run the code to download the corresponding Google HTML page.

2. parse the HTML page:

First, there are two parts to be operated. The first part is to first parse the URL of a webpage with a proxy list from Google, and then enter this URL for proxy list resolution. There are some problems here. First, I cannot know whether there is a proxy list on the home page after the webpage URL obtained from Google is entered. Second, a proxy list usually has several pages, how to Implement the page flip function. Because the structure of each webpage is inconsistent here, I cannot perform page flip operations on the proxy list of each webpage, therefore, you can directly find the proxy list on the home page and resolve it to your local computer.

If you have studied Google's webpage source code, you should know the specific structure. Here, you can also use the corresponding XML parsing library for parsing. However, because C ++ has fewer libraries, I was prepared to use the tinyxml library for parsing, and then found that if the file parsed by tinyxml is directly stored in the file, part of the information will be lost, so the use of tinyxml failed, then we considered the relevant HTML Parsing Library like htmlcxx, which should be able to implement the corresponding parsing function. However, due to the particularity of the Google Web page source code, the URL searched by the keyword is included between two specific strings. Therefore, this feature is used to read the downloaded HTML file row by row, then we can parse the corresponding URL based on this feature. The following is the source code.

#include<iostream>#include<iomanip>#include<fstream>#include<string>using namespace std;int main(){char buf[1027]={0};string str;bool find_flag = false;const string str_google_begin("

After resolving a URL with a proxy list, use the RegEx class library carried by boost to perform regular matching, find a string that meets the IP: Port type, and extract it, as for the complicated compilation process of boost, we recommend that you view explain.
Regular Expression operations in RegEx can view learning

By downloading the HTML webpage file corresponding to the URL to the local cache and then reading the cached file for related operations, you can select all strings matching the IP: port in the file.

Void analyse_proxy (const char * path) {const char * szreg = "(\ D {1, 3 })\\. (\ D {1, 3 })\\. (\ D {1, 3 })\\. (\ D {1027}) :( \ D {}) "; char Buf [] = {0}; string STR; file * fp = fopen (path, "R"); int II = 0; while (fgets (BUF, 1027, FP )! = NULL) {STR = Buf; const char * szstr = Str. c_str (); // use the iterator to locate all IP addresses: portboost: RegEx Reg (szreg); Boost: cregex_iterator itrbegin = make_regex_iterator (szstr, Reg); // (szstr, szstr + strlen (szstr), Reg); Boost: cregex_iterator itrend; For (boost: cregex_iterator itr = itrbegin; itr! = Itrend; ++ itr) {// cout of the sub-string content <* itr <Endl ;}// II ++ ;}// cout <II <Endl; fp = fopen (path, "W"); fclose (FP );}

4. After the first three small modules are completed, integrate the functions of the large modules, complete a caller to enter a search keyword, call the EXE, and then obtain the returned value of the large module. Then write a module to verify whether the server is available. This module is called by the third module, and then a true or false is returned for the call result, finally, connect the status to the string to print the output. It is just an idea to verify whether a module is available on the server. It will be completed later.

This project is completed based on the implementation function of C # Interface C ++, so here we will directly use the network connection method in C # for verification, use the network access class in C # To set the proxy server. Baidu is used as the test website. However, this method leads to a long verification time, you need to set a maximum wait time. In this way, the proxy server may be available in foreign countries, but if the delay time is too long, a false positive will occur, however, if some servers are used to verify whether the proxy server is available, some post operations are involved, which is complicated. Therefore, this method is not used here. The following code verifies whether the server is available

Public static Boolean checkproxy (string IP, int port) {httpwebrequest objhttprequest; httpwebresponse objresponse; WebProxy objproxy; objhttprequest = (httpwebrequest) webrequest. create (""); // set the test page, with Baidu to test objhttprequest. timeout = 1000; // set the timeout value objhttprequest. allowautoredirect = true; objhttprequest. contenttype = "application/X-WWW-form-urlencoded"; objproxy = new WebProxy (IP, Port); objproxy. bypassproxyonlocal = true; objhttprequest. proxy = objproxy; try {objresponse = (httpwebresponse) objhttprequest. getresponse () ;}catch (exception) {return false; // unavailable or time-out} return true; // available}

5. Start a multi-process program through a thread.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.