Regular expressions to remove or replace

Source: Internet
Author: User
Solution title:Regular expressions to remove or replace
Author: pmengal
Points: 500 grade:
Date: 05/12/2003 0:18 am PDT
Hello,

I want to use a regular expression to replace or remove some texts.

Replace
-------

I want to be able to replace> by & gt; in the following HTML text:

"<P> <strong> Superman is greater> than Spiderman </strong> </P>"

The same Code shocould work also for this text without any change:

"<SPAN class =" thisname <isinvalid "> <B> A> B? </B> </span>"

You understood, it's to use with a custom HTML encoder.

Remove
------

I want to remove all (tags encoded DED) that is between <SCRIPT> </SCRIPT> like:

"<SCRIPT> some malicious code </SCRIPT>"

The same Code (without any change, but different that the replace one of course) shocould work on this too:

"<Script language =" JavaScript "> some malicious code </SCRIPT>"

And this one

"<Script language = 'javascript '> some malicious code </SCRIPT>"

And this one too

"<SCRIPT dull =" dull "Language =" JavaScript "> some malicious code </SCRIPT>"

And this one too...

"<Script language =" JavaScript "> some malicious code </SCRIPT>"

Sorry to be so complete, but I posted some 500 and 250 questions and got incomplete answers due to the non complete enough question.

Thanks in advance!

Comment from pmengal
Date: 05/12/2003 0:19 am PDT
Author comment
Forgot to say:

Can you provide all the code to achieve this? Giving me just the regular expression will not help me. I'm not familiar with regular expressions at all...

If you have time, giving me some Website To learn is welcome

Comment from avonwyss
Date: 05/12/2003 05:22 am PDT
Comment
I will... stay tuned.

Comment from testn
Date: 05/12/2003 06:37 am PDT
Comment
Hi,

My previous RegEx shoshould work with>

Pattern = "((? <! (<([^>]) *) (>) | ((? <= (<(\/)? [^ A-Z, A-Z,/,]) {1} ([^>, <]) *) (> ))"

About tutorials,

Http://www.c-sharpcorner.com/3/RegExpPSD.asp
Http://www.wellho.net/regex/dotnet.html
Http://windows.oreilly.com/news/csharp_0101.html

If you want a comprehensive book, you might consider buying PDF from Amazon
Http://www.amazon.com/exec/obidos/tg/detail/-/B0000632ZU/102-4200309-1247344? Vi = glance

Accepted answer from testn
Date: 05/12/2003 06:42 am PDT
Accepted answer
This is the code for removing malicious code.
Using system. Text. regularexpressions;

Public String removemaliciouscode (string oldstr ){

String Pattern = @"(? I) <script ([^>]) *> (\ w | \ W) * </script ([^>]) *> ";
String newstr = RegEx. Replace (oldstr, pattern ,"");
Return newstr;
}

This function will return the string that contains no malicious code.

Comment from testn
Date: 05/12/2003 06:47 am PDT
Comment
Explain the function ......

(? I) means case-insensitive String Matching

<Script ([^>]) *> means finding any string starting with "<script" and contains 0 or more characters before ending with>

(\ W | \ W) * means may having some string in between <SCRIPT> and </SCRIPT> (0 or more characters of anything)

</Script ([^>]) *> "means finding any string starting with" </script "and contains 0 or more characters before ending with>

However, this one may be too extreme since it will also match the whole string

<Script language = "JavaScript"> some malicious code </SCRIPT> Hello <SCRIPT> </SCRIPT> without leaving hello

Comment from testn
Date: 05/12/2003 07:22 am PDT
Comment
You can make it better by putting

<SCRIPT [^>] *> .*? </Script [^>] *>

It will screen

<Script language = "JavaScript"> some malicious code </SCRIPT> Hello <SCRIPT> </SCRIPT>

To

Hello

Since .*? Mean non-Greedy matching it will try to match up least possible characters of the Pattern

Comment from testn
Date: 05/12/2003 07:33 am PDT
Comment
Please also keep testing when this applies to multiple lines data

You might need to change it

<SCRIPT [^>] *> (\ w | \ W )*? </Script [^>] *>

Or

(? M) <SCRIPT [^>] *> (\ w | \ W )*? </Script [^>] *>

Comment from avonwyss
Date: 05/12/2003 PM PDT
Comment
Private string replacematch (match ){
If (match. Groups ["script"]. Success)
Return "";
Else if (match. Groups ["GT"]. value = "> ")
Return "& gt ;";
Else
Return match. value;
}

Public String cleanuphtml (string html ){
Return RegEx. Replace (HTML ,@"(? <SCRIPT> <SCRIPT [^>] *> .*? </Script [^>] *>) | (? <GT> (<("[^" "]" "| '[^'] '| [^>]) +)?>) ", New matchevaluator (replacematch), regexoptions. explicitcapture | regexoptions. ignorecase | regexoptions. singleline );
}

This RegEx will do both your tasks and at the same time. the first part is pretty similar to testn's suggestion, but I also provide the code to find single> Chars (with no matching <before ).

Comment from osxmaster
Date: 02/23/2004 pm PST
Comment
Hi Do you know how I can clean this page from HTML and script tags?

Http://www.wipo.org

Seems to be very complicated.

Thanks

Comment from avonwyss
Date: 02/24/2004 pm PST
Comment
Yes I do. Just a few days ago I answered a very similar question; Have a look at recent posts in C # http: q_20892954.html

Or you can of course also post a new Q.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.