How to capture webpage data, analyze and remove HTML tags (C #)

Source: Internet
Author: User

<@ Aattention content = "this blog Original article, repost or reference, please indicate repost"
From = "robby.cnblogs.com" @>

As this part of the content is implemented in your search engine, today we will talk about how to capture webpage data, analyze and remove HTML tags, so as to provide you with a reference. My platform is visual
Studio2005, C #.

--------------------- Cut -------------------------

First, I will capture the entire webpage content, so I will not talk about it. It is not the focus of this topic. Assume that the captured data is stored in the recvbuffer byte [] (when the data is transmitted over the network, it is not a string, but a byte). The first step is to convert the recvbuffer into a string, to facilitate its operation, the instance is as follows:
//
Add the received data to the response string.


Strresponse
+ =
Encoding. ASCII. getstring (recvbuffer,
0
, Nbytes );



Strresponse is the string used to store data. The system is used here. text. the encoding method converts recvbuffer. The first parameter of getstring, recvbuffer, is our raw data, that is, the byte array containing the byte sequence to be decoded; the second parameter 0 represents the index of the first byte to be decoded, generally starting from 0. The third parameter nbytes is the number of bytes to be decoded, which can be adjusted by yourself.

Obtain the string form of the data, and then parse the webpage (in fact, it is the application of various string operations and regular expressions ). The following example shows how to parse webpage data:


//
Parse the page and find the link


//
Extension is required, and some forms of links are not identified.



String
Strref
=
 
@"
(Href | SRC | action) [] * = [] * ["" '] [^ ""' #>] + ["" ']
"
;

Matchcollection matches
=
 
New
RegEx (strref). Matches (strresponse );

Strstatus
+ =
 
"
Find:
"
+
Matches. Count
+
"
Connections \ r \ n
"
;

In the above example, links in the web page are parsed. The strref variable represents the regular expression pattern, the variable matches represents the set of items that match the matching, and the subsequent RegEx (strref ). matches (strresponse) is used to create regular rules so that all strings in strresponse that conform to the strref mode are returned. Then, call the matches variable to obtain various information.
Of course, only some basic link forms can be identified here, such as links in scripts and links without "" are not supported. This extension is quite simple.
Let's take a few simple parsing examples to learn:


//
Get title


Match titlematch
=
RegEx. Match (strresponse,
"
<Title> ([^ <] *) </title>
"
, Regexoptions. ignorecase
|
Regexoptions. multiline );

Title
=
Titlematch. Groups [
1
]. Value;




//
Get description information


Match DESC
=
RegEx. Match (strresponse,
"
<Meta name = \
"
Description \
"
Content = \
"
([
^ <
]
*
)\
"
>
"
, Regexoptions. ignorecase
|
Regexoptions. multiline );

Strdesc
=
Desc. Groups [
1
]. Value;



//
Obtain the webpage size


Size
=
Strresponse. length;

--------------------- Cut -------------------------

Now let's take a look at how to remove HTML tags, which is quite a must for beginners. In fact, it is still the application of regular expressions and basic string operations. Because this function is still quite common, the example is written as a function to facilitate the call:


/**/
///
 
<Summary>



///
Convert HTML tags into spaces


///
 
</Summary>



///
 
<Param name = "strhtml">
String to be converted
</Param>



///
 
<Returns>
Converted string
</Returns>



Private
 
String
Striphtml (
String
Strhtml)



{

RegEx objregexp
=
 
New
RegEx (
"
<(. | \ N) +?>
"
);


String
Stroutput
=
Objregexp. Replace (strhtml,
""
);

Stroutput
=
Stroutput. Replace (
"
<
"
,
"
& Lt;
"
);

Stroutput
=
Stroutput. Replace (
"
>
"
,
"
& Gt;
"
);


Return
Stroutput;

}

OK. In this way, the HTML Tag is basically gone, but some exceptions will make the HTML Tag not clean. Therefore, we recommend that you convert the HTML Tag twice in a row. But it's not over yet. If you pay attention, we can see that the above function actually converts HTML tags into spaces. Too many consecutive spaces will affect subsequent string operations. Therefore, add the following statement:


//
Convert all spaces into one space


RegEx R
=
 
New
RegEx (
@"
\ S +
"
);

Wordsonly
=
R. Replace (strresponse,
"
 
"
);

Wordsonly. Trim ();

Well, it's done. The wordsonly here is our final result-removing HTML tags and strings with extra spaces.
Hope to be useful to everyone!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.