10 A. net method for removing white-space strings

10 A. net method for removing white-space strings _ Practical Tips

Last Update:2017-01-19 Source: Internet

Author: User

Tags assert garbage collection string methods

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We have countless ways to remove all whitespace in a string, but which is faster?

Introduced

If you ask what the gap is, it's a bit of a mess to say. Many people think that whitespace is a space character (Unicodeu+0020,ascii 32,html), but it actually includes all the characters that make the layout horizontal and vertical. In fact, this is a whole class of characters defined as a database of Unicode characters.

The whitespace mentioned in this article refers not only to its correct definition, but also to string. Replace ("", "") method.

Here's the Datum method that will delete all the ends and the middle whitespace. This is the meaning of "all blanks" in the title of the article.

Background

The article began out of my curiosity. In fact, I don't need the fastest algorithm to remove whitespace from a string.

Check for whitespace characters

It is easy to check for whitespace characters. All the code you need is:

 char WP = '"; 
char a = ' a '; Assert.true (char. 
Iswhitespace (WP)); Assert.false (char. 
 
Iswhitespace (a)); However, when I implemented the Manual optimization deletion method, I realized that it was not as good as expected. Some source code in Microsoft's Reference source code library Char.cs Mining found: public static bool Iswhitespace (char c) {if (IsLatin1 (c)) {return (iswhitespacel 
  Atin1 (c)); 
return Charunicodeinfo.iswhitespace (c); Then Charunicodeinfo.iswhitespace became: internal static bool Iswhitespace (char c) {unicodecategory UC = Getunicodec 
  Ategory (c); 
  In Unicode 3.0, u+2028 are the only character which is under the category "LineSeparator". 
  and u+2029 is th eonly character which is under the category "Paragraphseparator". Switch (UC) {case (Unicodecategory.spaceseparator): Case (Unicodecategory.lineseparator): Case (Unicodecat Egory. 
  Paragraphseparator): Return (true); 
return (false); }

The GetUnicodeCategory () method calls the Internalgetunicodecategory () method and is actually quite fast, but now we have 4 method calls in turn! The following code is provided by a reviewer that can be used to quickly implement a custom version and JIT default inline:

Whitespace detection method:very fast, a lot faster than Char.iswhitespace 
[MethodImpl ( methodimploptions.aggressiveinlining)]//If it's not inlined then it'll be slow!!! 
public static bool Iswhitespace (CHAR-ch) { 
  //This is surprisingly faster than the equivalent if statement 
  switch (CH) {case ' \u0009 ': Case ' \u000a ': Case ' \u000b ': Case ' \u000c ': Case ' \u000d ': Case ' \u0020 ': Case ' 
    \u0085 ': Case ' \ U00a0 ': Case ' \u1680 ': Case ' \u2000 ': Case ' 
    \u2001 ': Case ' \u2002 ': Case ' \u2003 ': Case ' \u2004 ': Case ' \u2005 ': 
    C  Ase ' \u2006 ': Case ' \u2007 ': Case ' \u2008 ': Case ' \u2009 ': Case ' \u200a ': Case 
    ' \u2028 ': Case ' \u2029 ': Case ' \u202f ': Case ' \u205f ': Case ' \u3000 ': return 
      true; 
    Default: Return 
      false; 
  } 
}

Different ways to delete strings

I use a variety of different methods to implement all whitespace in the delete string.

Separation and Consolidation Method

This is a very simple method that I have been using. Separates the strings based on the space characters, but does not include empty items, and then merges the resulting fragments back together. It sounds a bit silly, but in fact, at first glance, it's like a very wasteful solution:

public static string Trimallwithsplitandjoin (String str) {return 
  string. Concat (str. Split (Default (string[]), stringsplitoptions.removeemptyentries); 
} 
 
LINQ 
 
This is the way to gracefully declaratively implement this procedure: public 
 
static string trimallwithlinq (String str) {return 
  new string (str). Where (c =>!iswhitespace (c)). ToArray ()); 
}

Regular expressions

Regular expressions are very powerful forces that any programmer should be aware of.

static regex whitespace = new Regex (@ "\s+", regexoptions.compiled); 
 
public static string Trimallwithregex (String str) {return 
  whitespace. Replace (str, ""); 
}

Character array in situ conversion method

This method converts the input string into a character array, and then scans the string in situ to remove white space characters (no intermediate buffers or strings are created). Finally, a "truncated" array produces a new string.

public static string Trimallwithinplacechararray (String str) { 
  var len = str. Length; 
  var src = str. ToCharArray (); 
  int dstidx = 0; 
  for (int i = 0; i < len; i++) { 
    var ch = src[i]; 
    if (!iswhitespace (CH)) 
      src[dstidx++] = ch; 
  } 
  return new string (src, 0, dstidx); 
}

Character array Copy method

This method is similar to a character array in-place conversion method, but it uses array.copy to copy consecutive Non-white "strings" while skipping spaces. Finally, it creates an array of characters of the appropriate size and returns a new string in the same way.

public static string Trimallwithchararraycopy (String str) {
  var len = str. Length;
  var src = str. ToCharArray ();
  int srcidx = 0, Dstidx = 0, count = 0;
  for (int i = 0; i < len; i++) {
    if (Iswhitespace (src[i))) {
      count = I-srcidx;
      Array.copy (SRC, srcidx, SRC, dstidx, count);
      Srcidx + = count + 1;
      Dstidx + = count;
      len--
    }
  }
  if (Dstidx < len)
    array.copy (src, srcidx, SRC, dstidx, len-dstidx);
  return new string (src, 0, len);
}

Cyclic Exchange method

Implement loops with code and use the StringBuilder class to create new strings by relying on intrinsic optimizations of StringBuilder. To avoid any other factor interfering with this implementation, no other methods are invoked, and the class members are not accessed by caching to local variables. Finally, the buffer is adjusted to the appropriate size by setting stringbuilder.length.

Code suggested by Http://www.codeproject.com/Members/TheBasketcaseSoftware

public static string Trimallwithlexerloop (string s) {
  int length = s.length;
  var buffer = new StringBuilder (s);
  var dstidx = 0;
  for (int index = 0; index < s.length; index++) {
    char ch = s[index];
    Switch (CH) {case ' \u0020 ': Case ' \u00a0 ': Case ' \u1680 ': Case '
      \u2000 ': Case ' \u2001 ': Case ' \u2002
      ': Case ' \u20 Case ' \u2004 ': Case ' \u2005 ': Case ' \u2006 ': Case ' \u2007 ': Case ' \u2008 ': Case ' \u2009 ': Case ' \u200a ': Case
      ' \u 202F ': Case ' \u205f ': Case ' \u3000 ': Case ' \u2028 ': Case '
      \u2029 ': Case ' \u0009 ': Case ' \u000a ': Case ' \u000b ':
      Case ' \u000c ': Case ' \u000d ': Case ' \u0085 ':
        length--;
        Continue;
      Default: Break
        ;
    }
    buffer[dstidx++] = ch;
  }
  Buffer. length = length;
  return buffer. ToString ();;
}

Circular word operator

This approach is almost the same as the previous cyclic switching method, but it uses an if statement to invoke Iswhitespace () instead of a messy switch trick:).

public static string Trimallwithlexerloopchariswhitespce (string s) {
  int length = s.length;
  var buffer = new StringBuilder (s);
  var dstidx = 0;
  for (int index = 0; index < s.length; index++) {
    char Currentchar = S[index];
    if (Iswhitespace (Currentchar))
      length--;
    else
      buffer[dstidx++] = Currentchar;
  }
  Buffer. length = length;
  return buffer. ToString ();;
}

Change String method in situ (unsafe)

This method uses unsafe character pointers and pointer operations to change the string in situ. I don't recommend this method because it's broken. NET Framework basic Conventions in production: strings are immutable.

public static unsafe string Trimallwithstringinplace (String str) {
  fixed (char* pfixed = str) {
    char* DST = pfixed ;
    for (char* p = pfixed; *p!= 0; p++)
      if (!iswhitespace (*p))
        *dst++ = *p;

/*//Reset the string size
      * Only IT didn ' T work! A garbage COLLECTION ACCESS violation occurred after USING IT
      * I HAD to RESORT to return A NEW STRING INSTEAD, WIT H only the pertinent BYTES
      * It WOULD to be A LOT faster IF it DID WORK THOUGH ...
    Int32 len = (Int32) (dst-pfixed);
    int32* pi = (int32*) pfixed;
    Pi[-1] = len;
    Pfixed[len] = ' i '; *
    /return new string (pfixed, 0, (int) (dst-pfixed));
  }

Change string method in situ V2 (unsafe)

This approach is almost the same as the previous one, but it uses an array-like pointer to access it. I'm curious to know which of these two storage accesses will be faster.

public static unsafe string TrimAllWithStringInplaceV2 (String str) {
  var len = str. Length;
  Fixed (char* pstr = str) {
    int dstidx = 0;
    for (int i = 0; i < len; i++)
      if (!iswhitespace (Pstr[i]))
        pstr[dstidx++] = pstr[i];
    Since the unsafe string length reset didn ' t work we need to resort to this slower compromise return
    new string (pstr , 0, DSTIDX);
  }

String.Replace ("", "")

This implementation is naïve, because it replaces only whitespace characters, so it does not use the correct definition of whitespace, so many other whitespace characters will be omitted. Although it should be the quickest method in this article, it is less functional than others.

But if you just need to remove the real whitespace characters, it's hard to use pure. NET writes more than String.Replace code. Most string methods will fall back to manually optimizing the local C + + code. And String.Replace itself will invoke the C + + method with Comstring.cpp:

FCIMPL3 (object*, 
  comstring::replacestring, 
  stringobject* thisrefunsafe, 
  stringobject* Oldvalueunsafe, 
  stringobject* newvalueunsafe)

Here is the benchmark Suite method:

public static string Trimallwithstringreplace (String str) {//the ' is not '
  functionaly equivalent to the others As it'll only trim "spaces"
  //whitespace comprises lots of the other characters return
  str. Replace ("", "");
}

The above is. NET to remove the blank string of the 10 methods, I hope to help you learn.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More