C # using a For loop to remove HTML tags _c# tutorial

Source: Internet
Author: User
Tags comments html tags regular expression static class

The most common way to remove HTML tags from a paragraph of text in order to eliminate the styles and paragraphs contained therein is the regular expression. Note, however, that regular expressions do not handle all HTML documents, so it is sometimes better to take an iterative approach, such as a For loop.

Look at the following code:

Using System;
Using System.Text.RegularExpressions;
<summary>///Methods to remove HTML from strings. </summary> public static class Htmlremoval {///<summary>///Remove HTML from string with Regex./// ;/summary> public static string Striptagsregex (string source) {return Regex.Replace (source, <.*?>), String.
}///<summary>///Compiled Regular expression for performance.
</summary> static Regex _htmlregex = new Regex ("<.*?>", regexoptions.compiled);
<summary>///Remove HTML from string with compiled Regex. </summary> public static string striptagsregexcompiled (string source) {return _htmlregex.replace source, String.
}///<summary>///Remove HTML tags from string using char array. </summary> public static string Striptagschararray (string source) {char[] array = new Char[source.
int arrayindex = 0;
BOOL inside = false; for (int i = 0; i < source. Length; I+ +) {char let = source[i]; if (let = = ' < ') {inside = true; continue} if (let = = ' > ') {inside = false; continue;
} if (!inside) {Array[arrayindex] = let; arrayindex++;}}
return new string (array, 0, arrayindex); }

The code provides two different ways to remove the HTML tags in a given string, one using regular expressions and one using a character array for processing in a for loop. Take a look at the results of the test:

Using System;
Using System.Text.RegularExpressions;
Class program
static void Main ()
const string html = ' <p>there was a <b>.NET</b> PR Ogrammer "+
" and he stripped the <i>HTML</i> tags.</p> ";
Console.WriteLine (Htmlremoval.striptagsregex (HTML));
Console.WriteLine (htmlremoval.striptagsregexcompiled (HTML));
Console.WriteLine (Htmlremoval.striptagschararray (HTML));

The output results are as follows:

There is a. NET programmer and he stripped the HTML tags.
There is a. NET programmer and he stripped the HTML tags.
There is a. NET programmer and he stripped the HTML tags.

Each of the three different methods in the Htmlremoval class, called in the preceding code, returns the same result, which removes the HTML markup from the given string. The second method is recommended, which is to refer directly to a predefined regexoptions.compiled regular expression object, which is faster than the first method. But RegexOptions.Compiled has some drawbacks, and in some cases it will start up dozens of times times more. The specific content can be viewed in the following two articles:

Regex Performance

In general, regular expressions are not the highest performing efficiency, so an alternative method is given in the Htmlremoval class, which uses a character array to handle strings. The test program provides 1000 HTML files, each HTML file contains about 8,000 characters, all files are read by File.readalltext Way, test results show the way the character array execution speed is the fastest.

Performance test for HTML removal

htmlremoval.striptagsregex:2404 ms
htmlremoval.striptagsregexcompiled:1366 ms
htmlremoval.striptagschararray:287 MS [Fastest]

File length test for HTML removal

File length before:8085 chars
htmlremoval.striptagsregex:4382 chars
htmlremoval.striptagsregexcompiled:4382 chars
htmlremoval.striptagschararray:4382 chars

Therefore, you can save time by using a character array to handle large volumes of files. In a character array method, simply adding non-HTML markup characters to an array buffer, it uses a character array and a new string constructor to receive character arrays and ranges, which is faster than using StringBuilder.

For self-closing HTML tags

In XHTML, some tags do not have separate closing tags, such as <br/>,. The above code should be able to handle the HTML tags that are closed correctly. Here are some supported HTML tags, noting that regular expression methods may not handle invalid HTML tags correctly.

Supported tags

< div >

Comments in an HTML document

The code given in this article may not be valid for HTML tags that are removed from an HTML document comment. Sometimes, comments may contain invalid HTML tags that are not completely removed when processed. However, it may sometimes be necessary to scan these incorrect HTML tags.

How to verify

There are a number of ways to validate XHTML, and we can iterate in the same way as the code above. An easy way is to count ' < ' and ' > ' to determine whether they match, or to use regular expressions to match. Here are some resources to describe these methods:

HTML brackets:validation

Validate XHTML

There are a number of methods that can be used to remove HTML tags from a given string, and the results returned are correct. There is no doubt that using character arrays is the most efficient iteration.

The above is a small set to introduce C # use for loop to remove HTML tags, I hope to help everyone, if you have any questions please give me a message, small series will promptly reply to everyone. Here also thank you very much for the cloud Habitat Community website support!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.