Page crawling (to be supplemented)

Source: Internet
Author: User

I believe that all webmasters of personal websites have the experience of capturing others' data. Currently, there are only two ways to capture others' website data:

1. Use a third-party tool, the most famous of which is the locomotive collector. We will not introduce it here.

2. write programs by yourself. In this way, the webmaster is required to write programs by himself, which may be required for the webmaster's development capabilities.

At first, I tried to use third-party tools to capture the data I needed. Because the popular third-party tools on the Internet do not meet my requirements, they are too complicated, I didn't understand how to use it for a moment. I decided to write it myself. Now I can basically get a website (only program development time, not the data capture time) over the last half day ).

After a period of data capturing, I encountered many difficulties. One of the most common problems is the paging Data Capturing problem, because there are many data paging forms, below I will mainly introduce the methods for capturing paging data in three forms. Although I have seen many such articles on the internet, every time I take others' code, there are always various problems, the following code can be correctly executed in various ways, and I am currently using it. The code implementation in this article is implemented in the C # language. I think the principles of other languages are roughly the same.

The following is the question:

Method 1: The URL address contains paging information. This form is the simplest. Using a third-party tool to capture information is also very simple. Basically, no code is required, for people like me who would rather spend half a day writing code than learning third-party tools, I did it by myself;

In this way, the URL address of the Data paging is generated cyclically, for example, access the corresponding URL address through HttpWebRequest and return the html text of the corresponding page. The next task is to parse the string, save the required content to the local database. For the captured code, refer to the following:

Public string GetResponseString (string url)

String _ StrResponse = "";
HttpWebRequest _ WebRequest = (HttpWebRequest) WebRequest. Create (url );
_ WebRequest. userAgent = "MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; windows nt 5.2 ;. net clr 1.1.4322 ;. net clr 2.0.50727 ;. net clr 3.0.04506.648 ;. net clr 3.5.21022 ;. net clr ;. net clr 3.5.30729 )";
_ WebRequest. Method = "GET ";
WebResponse _ WebResponse = _ WebRequest. GetResponse ();
StreamReader _ ResponseStream = new StreamReader (_ WebResponse. GetResponseStream (), System. Text. Encoding. GetEncoding ("gb2312 "));
_ StrResponse = _ ResponseStream. ReadToEnd ();
_ WebResponse. Close ();
_ ResponseStream. Close ();
Return _ StrResponse;

The code above can return the html content string of the corresponding page. The rest of the work is to get the information you are concerned about from this string.


Method 2: A website developed through often encounters this problem. Its paging control submits paging information to the background code through post, such. net. When you click the page number, the URL address is not changed, but the page number is changed, and the page content is also changed, when you move the cursor over a page number, the status bar displays javascript :__ dopostback ("gridview", "page1") and other code. This form is not very difficult, after all, there is a rule to get the page number.

We know that there are two methods to submit an http request: get, Post, get, and post, it is not necessary to elaborate on the principle of submission, but it is not the focus of this article.

To capture such pages, pay attention to several important elements of the page.

1. _ VIEWSTATE, which should be. net is also unique. net developers love and hate things. When you open a page of a website, if you find this thing and it is followed by a bunch of messy characters, this website must have been written using;

Ii. _ dopostback method. This is a javascript method automatically generated by the page. It contains two parameters, __eventtarget ,__ EVENTARGUMENT. For details about these two parameters, see the page number, because when you click to flip the page, the page number information will be sent to these two parameters.

3. _ EVENTVALIDATION should also be unique to

You don't have to worry too much about what these three things do. You just need to pay attention to submitting these three elements when writing your own code to capture the page.

Like the first method, you must splice the _ dopostback parameters in a loop. You only need to spell the parameters that contain page number information. Note that every time a Post request is submitted to the next page, the _ VIEWSTATE and _ EVENTVALIDATION information of the current page should be obtained first, therefore, the first page of the paging data can obtain the page content in the first way, then retrieve the corresponding _ VIEWSTATE information and _ EVENTVALIDATION information, and then process the next page cyclically, after capturing a page, record the _ VIEWSTATE information and _ EVENTVALIDATION information to submit data for the next page post.

The reference code is as follows:

For (int I = 0; I <1000; I ++)
System. Net. WebClient WebClientObj = new System. Net. WebClient ();
System. Collections. Specialized. NameValueCollection PostVars = new System. Collections. Specialized. NameValueCollection ();
PostVars. Add ("_ VIEWSTATE", "Here is the information you need to get in advance ");
PostVars. Add ("_ EVENTVALIDATION", "Here is the information you need to get in advance ");
PostVars. Add ("_ EVENTTARGET", "Here is the parameter corresponding to the _ dopostback method ");
PostVars. Add ("_ EVENTARGUMENT", "Here is the parameter corresponding to the _ dopostback method ");
WebClientObj. Headers. Add ("ContentType", "application/x-www-form-urlencoded ");
Byte [] byte1 = WebClientObj. UploadValues ("", "POST", PostVars );
String ResponseStr = Encoding. UTF8.GetString (byte1); // obtain the html text string corresponding to the current page
GetPostValue (ResponseStr); // obtain the required information such as _ VIEWSTATE corresponding to the current page.
SaveMessage (ResponseStr); // Save the content you care about to the database.
Catch (Exception ex)
Console. WriteLine (ex. Message );


The third method is the most troublesome and disgusting. During the page turning process, no page information can be found anywhere. This method takes a lot of effort, later, I used a method to simulate manual page turning with code. This method should be able to process any form of page turning data. The principle is to use code to simulate manual click on the page turning link, use code to flip pages one page at a time, and then capture one page at a time


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.