Webpage paging Data Capturing Methods
I believe that all webmasters of personal websites have the experience of capturing others' data. Currently, there are only two ways to capture others' website data:
1. Use a third-party tool, the most famous of which is the locomotive collector. We will not introduce it here.
Ii. Write by yourselfProgramCapture, This method requires the webmaster to write their own programs, may have requirements on the webmaster's development capabilities.
At first, I tried to use third-party tools to capture the data I needed. Because the popular third-party tools on the Internet do not meet my requirements, they are too complicated, I didn't understand how to use it for a moment. I decided to write it myself. Now I can basically get a website (only program development time, not the data capture time) over the last half day ).
After a period of data capturing, I encountered many difficulties. One of the most common problems is the paging Data Capturing problem, because there are many data paging forms, the following describes how to capture paging data in three forms.ArticleAlthough I have seen a lot on the internetCodeThere are always various problems. The Code in the following methods can be correctly executed and I am still using it. The code implementation in this article is implemented in the C # language. I think the principles of other languages are roughly the same.
The following is the question:
Method 1: The URL address contains paging information. This form is the simplest. Using a third-party tool to capture information is also very simple. Basically, no code is required, for people like me who would rather spend half a day writing code than learning third-party tools, I did it by myself;
In this way, the URL address of the Data paging is generated cyclically, for example, access the corresponding URL address through httpwebrequest and return the HTML text of the corresponding page. The next task is to parse the string, save the required content to the local database. For the captured code, refer to the following:
Public String getresponsestring (string URL)
{
String _ strresponse = "";
Httpwebrequest _ webrequest = (httpwebrequest) webrequest. Create (URL );
_ Webrequest. useragent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2 ;. net CLR 1.1.4322 ;. net CLR 2.0.50727 ;. net CLR 3.0.04506.648 ;. net CLR 3.5.21022 ;. net CLR 3.0.20.6.2152 ;. net CLR 3.5.30729 )";
_ Webrequest. method = "get ";
Webresponse _ webresponse = _ webrequest. getresponse ();
Streamreader _ responsestream = new streamreader (_ webresponse. getresponsestream (), system. Text. encoding. getencoding ("gb2312 "));
_ Strresponse = _ responsestream. readtoend ();
_ Webresponse. Close ();
_ Responsestream. Close ();
Return _ strresponse;
}
The code above can return the HTML content string of the corresponding page. The rest of the work is to get the information you are concerned about from this string.
Method 2: A website developed through Asp.net often encounters this problem. Its paging control submits paging information to the background code through post, such. net. When you click the page number, the URL address is not changed, but the page number is changed, and the page content is also changed, when you move the cursor over a page number, the status bar displays JavaScript :__ dopostback ("gridview", "page1") and other code. This form is not very difficult, after all, there is a rule to get the page number.
We know that there are two methods to submit an HTTP request: Get, post, get, and post, it is not necessary to elaborate on the principle of submission, but it is not the focus of this article.
To capture such pages, pay attention to several important elements of the Asp.net page.
1. _ viewstate, which should be. net is also unique. NET developers love and hate things. When you open a page of a website, if you find this thing and it is followed by a bunch of messy characters, this website must have been written using Asp.net;
Ii. _ dopostback method. This is a JavaScript method automatically generated by the Asp.net page. It contains two parameters, __eventtarget ,__ eventargument. For details about these two parameters, see the page number, because when you click to flip the page, the page number information will be sent to these two parameters.
3. _ eventvalidation should also be unique to Asp.net.
You don't have to worry too much about what these three things do. You just need to pay attention to submitting these three elements when writing your own code to capture the page.
Like the first method, you must splice the _ dopostback parameters in a loop. You only need to spell the parameters that contain page number information. Note that every time a POST request is submitted to the next page, the _ viewstate and _ eventvalidation information of the current page should be obtained first, therefore, the first page of the paging data can obtain the page content in the first way, then retrieve the corresponding _ viewstate information and _ eventvalidation information, and then process the next page cyclically, after capturing a page, record the _ viewstate information and _ eventvalidation information to submit data for the next page post.
The reference code is as follows:
For (INT I = 0; I <1000; I ++)
{
System. net. WebClient webclientobj = new system. net. WebClient ();
System. Collections. Specialized. namevaluecollection postvars = new system. Collections. Specialized. namevaluecollection ();
Postvars. Add ("_ viewstate "," Here is the information you need in advance ");
Postvars. Add ("_ eventvalidation "," Here is the information you need in advance ");
Postvars. Add ("_ eventtarget "," Here is the parameter corresponding to the _ dopostback method. ");
Postvars. Add ("_ eventargument "," Here is the parameter corresponding to the _ dopostback method. ");
Webclientobj. headers. Add ("contenttype", "application/X-WWW-form-urlencoded ");
Try
{
Byte [] byte1 = webclientobj. uploadvalues ("http://www.xxxx.cn/messagelist.aspx", "Post", postvars );
String responsestr = encoding. utf8.getstring (byte1 ); // Obtain the HTML text string corresponding to the current page
Getpostvalue (responsestr ); // Obtain the required information such as _ viewstate corresponding to the current page.
Savemessage (responsestr ); // Save the content you care about to the database
}
Catch (exception ex)
{
Console. writeline (ex. Message );
}
}
The third method is the most troublesome and disgusting. During the page turning process, no page information can be found anywhere. This method takes a lot of effort, later, I used a method to simulate manual page turning with code. This method should be able to process any form of page turning data. The principle is to use code to simulate manual click on the page turning link, use code to flip pages one page at a time, and then crawl pages at a time.