Can be written like this
string @" [\u4e00-\u9fa5]| [\(\)\《\》\——\;\,\。 \ "\" \<\>\! ]";
The first half represents the matching Chinese characters, and the second half is the punctuation that needs to be matched.
Other
For the processing of HTML source code, it is recommended to use Htmlagilitypack, with the following code to remove the script, style or comment content.
Public StaticHTMLDocument Initializehtmldoc (stringhtmlstring) { if(string. IsNullOrEmpty (htmlstring)) {return NULL; } HTMLDocument Doc=NewHTMLDocument (); Doc. Loadhtml (htmlstring); Doc. Documentnode.descendants (). Where (n= = N.name = ="Script"|| N.name = ="style"|| N.name = ="#comment"). ToList (). ForEach (n =N.remove ()); returnDoc;}
Htmlagilitypack is using the XPath syntax, "//comment ()" means "all annotation nodes" in XPath, "#comment" is not good to replace. Http://www.cnblogs.com/rupeng/archive/2012/02/07/2342012.html
Read the Web page content from the URL (static), you can use the following code
Public Static stringGETHTMLSTR (stringURL) { if(string. IsNullOrEmpty (URL)) {return string. Empty; } stringHTML =string. Empty; Try{WebRequest WebRequest=webrequest.create (URL); Webrequest.timeout= -* +; using(WebResponse WebResponse =Webrequest.getresponse ()) { if(((HttpWebResponse) WebResponse). StatusCode = =Httpstatuscode.ok) {stream stream=WebResponse.GetResponseStream (); stringCoder =((HttpWebResponse) WebResponse). CharacterSet; StreamReader Reader=NewStreamReader (Stream,string. IsNullOrEmpty (coder)?Encoding.Default:Encoding.GetEncoding (coder)); HTML=Reader. ReadToEnd (); } } } Catch(Exception ex) {//Request may timeout sometimes } returnhtml;}
Regular expression matches Chinese characters and punctuation