C # analyze captured webpage data

Source: Internet
Author: User
Tags key string

First, capture the entire webpage content and place the data in byte [] (the Network upload and transmission form is byte) to further convert it to a string to facilitate its operation. The example is as follows:
Private Static string getpagedata (string URL)
{
If (url = NULL | URL. Trim () = "")
Return NULL;
WebClient WC = new WebClient (); // Definition
WC. Credentials = credentialcache. defaultcredentials;
Byte [] pagedata = WC. downloaddata (URL );
Return encoding. Default. getstring (pagedata); //. ASCII. getstring
}

The above is just a simple way to get webpage data, but there is still a coding problem in the webpage. In this case, to avoid garbled code problems, we need to handle it. The Code is as follows:

// The URL is the address of the website to be accessed, and charset is the encoding of the target webpage. If null or "" is input, the code of the webpage is automatically analyzed.
Private string getpagedata (string URL, string charset)
{
String strwebdata = string. empty;
If (URL! = NULL | URL. Trim ()! = "")
{
WebClient mywebclient = new WebClient ();
// Create a WebClient instance mywebclient
// Note the following:
// Some webpages may not be available, for various reasons such as Cookie and Encoding Problems
// This requires specific problem analysis, such as adding a cookie to the header
// WebClient. headers. Add ("cookie", cookie );
// Some overload methods may be required. Write as needed
// Obtain or set the network creden。 used to authenticate requests to Internet resources.
Mywebclient. Credentials = credentialcache. defaultcredentials;
// If the server needs to verify the user name and password
// Networkcredential mycred = new networkcredential (struser, strpassword );
// Mywebclient. Credentials = mycred;
// Download data from the resource and return a byte array. (Add @ because there is a "/" symbol in the middle of the URL)
Byte [] mydatabuffer = mywebclient. downloaddata (URL );
Strwebdata = encoding. Default. getstring (mydatabuffer );
// Obtain the character encoding description of the webpage
Match charsetmatch = RegEx. match (strwebdata, "<Meta ([^ <] *) charset = ([^ <] *)/" ", regexoptions. ignorecase | regexoptions. multiline );
String webcharset = charsetmatch. Groups [2]. value;
If (charset = NULL | charset = "")
{
// If no encoding is obtained, set the default encoding.
If (webcharset = NULL | webcharset = "")
{
Charset = "UTF-8 ";
}
Else
{
Charset = webcharset;
}
}
If (charset! = NULL & charset! = "" & Encoding. getencoding (charset )! = Encoding. Default)
{
Strwebdata = encoding. getencoding (charset). getstring (mydatabuffer );
}
}
Return strwebdata;
}

Obtain the string form of the data, and then parse the webpage (in fact, it is the application of various string operations and regular expressions ):
// Parse the page and find the link
// Extension is required, and some forms of links are not recognized.
String strref = @ "(href | SRC | action) [] * = [] * ["'] [^" "' #>] + [" '] ";
Matchcollection matches = new RegEx (strref). Matches (strresponse );
Strstatus + = "found:" + matches. Count + "Links/R/N ";

In the above example, links in the web page are parsed. The strref variable represents the regular expression pattern, the variable matches represents the set of items that match the matching, and the subsequent RegEx (strref ). matches (strresponse) is used to create regular rules so that all strings in strresponse that conform to the strref mode are returned. Then, call the matches variable to obtain various information.
Of course, only some basic link forms can be identified here, such as links in scripts and links without "" are not supported. This extension is relatively simple.

Common resolutions include the following:
// Obtain the title
Match titlematch = RegEx. Match (strresponse, "<title> ([^ <] *) </title>", regexoptions. ignorecase | regexoptions. multiline );
Title = titlematch. Groups [1]. value;

// Obtain the description
Match DESC = RegEx. match (strresponse, "<meta name =/" Description/"content =/" ([^ <] *)/">", regexoptions. ignorecase | regexoptions. multiline );
Strdesc = DESC. Groups [1]. value;

// Obtain the webpage size
Size = strresponse. length;

// Remove HTML tags

Private string striphtml (string strhtml)
{
RegEx objregexp = new RegEx ("<(. |/n) +?> ");
String stroutput = objregexp. Replace (strhtml ,"");
Stroutput = stroutput. Replace ("<", "& lt ;");
Stroutput = stroutput. Replace (">", "& gt ;");
Return stroutput;
}
Some exceptions may make the removal non-clean, so it is recommended that the conversion be performed twice in a row. In this way, HTML tags are converted to spaces. Too many consecutive spaces will affect subsequent string operations. Therefore, add the following statement:
// Convert all spaces into one space
RegEx r = new RegEx (@ "/S + ");
Wordsonly = R. Replace (strresponse ,"");
Wordsonly. Trim ();
It is not easy to write, but I can still understand it. Pay attention to using system. text;
Using system. Text. regularexpressions; write
In practical application: I did a test to obtain the weather forecast of the meteorological department in a certain area.

Private Static string getpagedata (string URL) // obtain the URL string
{
If (url = NULL | URL. Trim () = "")
Return NULL;
WebClient WC = new WebClient ();
WC. Credentials = credentialcache. defaultcredentials;
Byte [] pagedata = WC. downloaddata (URL );
Return encoding. Default. getstring (pagedata); //. ASCII. getstring
}

Protected void button#click (Object sender, eventargs E)
{
String strresponse = getpagedata (textbox1. Text );
String strref = @ "(href | SRC | action) [] * = [] * ["'] [^" "' #>] + [" '] ";
Matchcollection matches = new RegEx (strref). Matches (strresponse); // string matched in strresponse
Textbox2.text = "found:" + matches. Count + "Links/R/N ";
// The above is a learning example, and the test is successful. The following shows the application.
String strref2 = @ "marquee"; // defines the key string. The keyword I used to view on the webpage for data retrieval is the same as the HTML source.
Textbox3.text = strresponse. substring (strresponse. indexof (strref2, 2800) + 120,135). tostring ();
// The strresponse string contains 2800 characters starting from strresponse. indexof (strref2, 120) + 135.
}
Test: In textbox1. Text, enter: Success.
Finally, we need to make a small modification. In some cases, we need to adjust the character uncertainty. For example, in this weather condition, sometimes there are not so many characters displayed. If you want to display them hard, it will lead to errors. My source code segment
// Capture weather conditions
String strresponse = getpagedata ("http://www.py121.com/weathe.jsp ");
String strref2 = @ "marquee ";
String str_last_index = strresponse. substring (strresponse. indexof (strref2, 2800) + 113, 70). Trim (). tostring ();
If (str_last_index.indexof ('time')> 1) // 091110 it is found that the problem occurs when two 'Times' cannot be solved at all times. Therefore, the original version will be modified after 10 rows.
{
Get_weathe = strresponse. substring (strresponse. indexof (strref2, 2800) + 113, str_last_index.indexof ('time') + 2 ). trim (). tostring (); // The string starts from indexof (strref2, 2800) + 113, and the 'time' indicates the end of the flag.
}
Else
{
Get_weathe = strresponse. substring (strresponse. indexof (strref2, 2800) + 113, 60). Trim (). tostring ();
}
This time, I used indexof to determine the position of the character when the character appears, and solved the vulnerability that sometimes cannot be displayed.
Get_weathe is a global variable and can be called in Javascript at the front end. OK

091110 update

If (str_last_index.indexof ('time')> 1)
{
If (str_last_index.indexof ('time') <50) // determine whether 'time' is greater than 1 index in the string, but more than one 'time'
{
Get_weathe = strresponse. substring (strresponse. indexof (strref2, 2800) + 113, str_last_index.lastindexof ('time') + 10 ). trim (). tostring (); // The string starts from indexof (strref2, 2800) + 113, and the 'time' indicates the end of the flag.
}
Else
{
Get_weathe = strresponse. substring (strresponse. indexof (strref2, 2800) + 113, str_last_index.indexof ('time') + 2 ). trim (). tostring (); // The string starts from indexof (strref2, 2800) + 113, and the 'time' indicates the end of the flag.
}
}
Else
{
Get_weathe = strresponse. substring (strresponse. indexof (strref2, 2800) + 113, 60). Trim (). tostring ();
}
Of course, we can also capture changes and updates to the data of a website, which requires the output & input database.
The key to this example is: the application of various string operations and regular expressions.


A Brief Introduction to WebClient:
The WebClient class provides public methods for sending data to any local, Intranet, or Internet Resource identified by the URI and receiving data from these resources.
The WebClient class uses the webrequest class to provide access to resources. A WebClient instance can access data through any webrequest child that has been registered with the webrequest. registerprefix method.
Note:
By default,. NET Framework supports Uris starting with HTTP:, https:, ftp:, and file: Scheme identifier.

The following describes the WebClient method used to upload data to a resource:
Openwrite retrieves a stream used to send data to resources.
Openwriteasync retrieves stream, which sends data to resources without stopping the calling thread.
Uploaddata sends the byte array to the resource and returns the byte array containing any response.
Uploaddataasync sends the byte array to the resource without blocking the calling thread.
Uploadfile sends the local file to the resource and returns a byte array containing any response.
Uploadfileasync sends local files to resources without stopping the calling thread.
Uploadvalues sends the namevaluecollection to the resource and returns a byte array containing any response.
Uploadvaluesasync sends the namevaluecollection to the resource without blocking the calling thread, and returns a byte array containing any response.
Uploadstring sends the string to the resource without stopping the calling thread.
Uploadstringasync sends the string to the resource without stopping the calling thread.

The following describes the WebClient Method for downloading data from a resource:
Openread returns data from a resource in the form of stream.
Openreadasync returns data from the resource without blocking the call thread.
Downloaddata downloads data from the resource and returns the byte array.
Downloaddataasync downloads data from the resource and returns a byte array without stopping the calling thread.
Downloadfile downloads data from a resource to a local file.
Downloadfileasync downloads data from the resource to a local file without stopping the calling thread.
Downloadstring downloads the string from the resource and returns the string.
Downloadstringasync downloads the string from the resource without blocking the call thread.

You can use the cancelasync method to cancel unfinished asynchronous operations.
By default, WebClient instances do not send optional HTTP headers. If your request requires an optional Header, you must add it to the headers set. For example, to retain the query in the response, you must add the user proxy header. In addition, if the user proxy header is lost, the server may return 500 (internal server error ).
In the WebClient instance, allowautoredirect is set to true.
The description to the successor indicates that the derived class should call the basic class implementation of WebClient to ensure that the derived class works as expected.
For example:

C # Use WebClient class to steal site (http://www.ip138.com) mobile phone number information

First, each of us wants to check the location of the recipient's mobile phone and what type of card it belongs to. Then, we use the C # encapsulated class WebClient, namevaluecollection, and RegEx to respectively belong to the using system. net, using system. text, using system. collections. specialized, using system. text. regularexpressions. It is tested in the vs2005 environment. The following code is used for testing:
Private void initweaone ()
{
3 WebClient WB = new WebClient ();
4 namevaluecollection mynamevaluecollection = new namevaluecollection ();
5
6 mynamevaluecollection. Add ("mobile", "13777483912 ");
7 mynamevaluecollection. Add ("action", "mobile ");
8 byte [] pagedata = WB. uploadvalues (http://www.ip138.com: 8080/search. asp, mynamevaluecollection );
9 string result = encoding. Default. getstring (pagedata );
10 string PAT = "tdc2> ([^ <] *) </TD> ";
11 RegEx r = new RegEx (Pat, regexoptions. ignorecase );
12 Match m = R. Match (result );
13 string [] strinfo = new string [3] {"", "", ""};
14 int I = 0;
15 while (M. Success)
16 {
17 if (I <strinfo. length)
18 {
19 strinfo [I] = M. tostring (). substring (5 );
20}
21 m = M. nextmatch ();
22 I ++;
23}
24 string a = strinfo [0]. tostring ();
25 string G = strinfo [1]. tostring ();
26 string F = strinfo [2]. tostring ();
27
28}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.