We have countless ways to remove all whitespace in a string, but which is faster?
Introduced
If you ask what the gap is, it's a bit of a mess to say. Many people think that whitespace is a space character (Unicodeu+0020,ascii 32,html), but it actually includes all the characters that make the layout horizontal and vertical. In fact, this is a whole class of characters defined as a database of Unicode characters.
The whitespace mentioned in this article refers not only to its correct definition, but also to string. Replace ("", "") method.
Here's the Datum method that will delete all the ends and the middle whitespace. This is the meaning of "all blanks" in the title of the article.
Background
The article began out of my curiosity. In fact, I don't need the fastest algorithm to remove whitespace from a string.
Check for whitespace characters
It is easy to check for whitespace characters. All the code you need is:
char WP = '";
char a = ' a '; Assert.true (char.
Iswhitespace (WP)); Assert.false (char.
Iswhitespace (a)); However, when I implemented the Manual optimization deletion method, I realized that it was not as good as expected. Some source code in Microsoft's Reference source code library Char.cs Mining found: public static bool Iswhitespace (char c) {if (IsLatin1 (c)) {return (iswhitespacel
Atin1 (c));
return Charunicodeinfo.iswhitespace (c); Then Charunicodeinfo.iswhitespace became: internal static bool Iswhitespace (char c) {unicodecategory UC = Getunicodec
Ategory (c);
In Unicode 3.0, u+2028 are the only character which is under the category "LineSeparator".
and u+2029 is th eonly character which is under the category "Paragraphseparator". Switch (UC) {case (Unicodecategory.spaceseparator): Case (Unicodecategory.lineseparator): Case (Unicodecat Egory.
Paragraphseparator): Return (true);
return (false); }
The GetUnicodeCategory () method calls the Internalgetunicodecategory () method and is actually quite fast, but now we have 4 method calls in turn! The following code is provided by a reviewer that can be used to quickly implement a custom version and JIT default inline:
Whitespace detection method:very fast, a lot faster than Char.iswhitespace
[MethodImpl ( methodimploptions.aggressiveinlining)]//If it's not inlined then it'll be slow!!!
public static bool Iswhitespace (CHAR-ch) {
//This is surprisingly faster than the equivalent if statement
switch (CH) {case ' \u0009 ': Case ' \u000a ': Case ' \u000b ': Case ' \u000c ': Case ' \u000d ': Case ' \u0020 ': Case '
\u0085 ': Case ' \ U00a0 ': Case ' \u1680 ': Case ' \u2000 ': Case '
\u2001 ': Case ' \u2002 ': Case ' \u2003 ': Case ' \u2004 ': Case ' \u2005 ':
C Ase ' \u2006 ': Case ' \u2007 ': Case ' \u2008 ': Case ' \u2009 ': Case ' \u200a ': Case
' \u2028 ': Case ' \u2029 ': Case ' \u202f ': Case ' \u205f ': Case ' \u3000 ': return
true;
Default: Return
false;
}
}
Different ways to delete strings
I use a variety of different methods to implement all whitespace in the delete string.
Separation and Consolidation Method
This is a very simple method that I have been using. Separates the strings based on the space characters, but does not include empty items, and then merges the resulting fragments back together. It sounds a bit silly, but in fact, at first glance, it's like a very wasteful solution:
public static string Trimallwithsplitandjoin (String str) {return
string. Concat (str. Split (Default (string[]), stringsplitoptions.removeemptyentries);
}
LINQ
This is the way to gracefully declaratively implement this procedure: public
static string trimallwithlinq (String str) {return
new string (str). Where (c =>!iswhitespace (c)). ToArray ());
}
Regular expressions
Regular expressions are very powerful forces that any programmer should be aware of.
static regex whitespace = new Regex (@ "\s+", regexoptions.compiled);
public static string Trimallwithregex (String str) {return
whitespace. Replace (str, "");
}
Character array in situ conversion method
This method converts the input string into a character array, and then scans the string in situ to remove white space characters (no intermediate buffers or strings are created). Finally, a "truncated" array produces a new string.
public static string Trimallwithinplacechararray (String str) {
var len = str. Length;
var src = str. ToCharArray ();
int dstidx = 0;
for (int i = 0; i < len; i++) {
var ch = src[i];
if (!iswhitespace (CH))
src[dstidx++] = ch;
}
return new string (src, 0, dstidx);
}
Character array Copy method
This method is similar to a character array in-place conversion method, but it uses array.copy to copy consecutive Non-white "strings" while skipping spaces. Finally, it creates an array of characters of the appropriate size and returns a new string in the same way.
public static string Trimallwithchararraycopy (String str) {
var len = str. Length;
var src = str. ToCharArray ();
int srcidx = 0, Dstidx = 0, count = 0;
for (int i = 0; i < len; i++) {
if (Iswhitespace (src[i))) {
count = I-srcidx;
Array.copy (SRC, srcidx, SRC, dstidx, count);
Srcidx + = count + 1;
Dstidx + = count;
len--
}
}
if (Dstidx < len)
array.copy (src, srcidx, SRC, dstidx, len-dstidx);
return new string (src, 0, len);
}
Cyclic Exchange method
Implement loops with code and use the StringBuilder class to create new strings by relying on intrinsic optimizations of StringBuilder. To avoid any other factor interfering with this implementation, no other methods are invoked, and the class members are not accessed by caching to local variables. Finally, the buffer is adjusted to the appropriate size by setting stringbuilder.length.
Code suggested by Http://www.codeproject.com/Members/TheBasketcaseSoftware
public static string Trimallwithlexerloop (string s) {
int length = s.length;
var buffer = new StringBuilder (s);
var dstidx = 0;
for (int index = 0; index < s.length; index++) {
char ch = s[index];
Switch (CH) {case ' \u0020 ': Case ' \u00a0 ': Case ' \u1680 ': Case '
\u2000 ': Case ' \u2001 ': Case ' \u2002
': Case ' \u20 Case ' \u2004 ': Case ' \u2005 ': Case ' \u2006 ': Case ' \u2007 ': Case ' \u2008 ': Case ' \u2009 ': Case ' \u200a ': Case
' \u 202F ': Case ' \u205f ': Case ' \u3000 ': Case ' \u2028 ': Case '
\u2029 ': Case ' \u0009 ': Case ' \u000a ': Case ' \u000b ':
Case ' \u000c ': Case ' \u000d ': Case ' \u0085 ':
length--;
Continue;
Default: Break
;
}
buffer[dstidx++] = ch;
}
Buffer. length = length;
return buffer. ToString ();;
}
Circular word operator
This approach is almost the same as the previous cyclic switching method, but it uses an if statement to invoke Iswhitespace () instead of a messy switch trick:).
public static string Trimallwithlexerloopchariswhitespce (string s) {
int length = s.length;
var buffer = new StringBuilder (s);
var dstidx = 0;
for (int index = 0; index < s.length; index++) {
char Currentchar = S[index];
if (Iswhitespace (Currentchar))
length--;
else
buffer[dstidx++] = Currentchar;
}
Buffer. length = length;
return buffer. ToString ();;
}
Change String method in situ (unsafe)
This method uses unsafe character pointers and pointer operations to change the string in situ. I don't recommend this method because it's broken. NET Framework basic Conventions in production: strings are immutable.
public static unsafe string Trimallwithstringinplace (String str) {
fixed (char* pfixed = str) {
char* DST = pfixed ;
for (char* p = pfixed; *p!= 0; p++)
if (!iswhitespace (*p))
*dst++ = *p;
/*//Reset the string size
* Only IT didn ' T work! A garbage COLLECTION ACCESS violation occurred after USING IT
* I HAD to RESORT to return A NEW STRING INSTEAD, WIT H only the pertinent BYTES
* It WOULD to be A LOT faster IF it DID WORK THOUGH ...
Int32 len = (Int32) (dst-pfixed);
int32* pi = (int32*) pfixed;
Pi[-1] = len;
Pfixed[len] = ' i '; *
/return new string (pfixed, 0, (int) (dst-pfixed));
}
Change string method in situ V2 (unsafe)
This approach is almost the same as the previous one, but it uses an array-like pointer to access it. I'm curious to know which of these two storage accesses will be faster.
public static unsafe string TrimAllWithStringInplaceV2 (String str) {
var len = str. Length;
Fixed (char* pstr = str) {
int dstidx = 0;
for (int i = 0; i < len; i++)
if (!iswhitespace (Pstr[i]))
pstr[dstidx++] = pstr[i];
Since the unsafe string length reset didn ' t work we need to resort to this slower compromise return
new string (pstr , 0, DSTIDX);
}
String.Replace ("", "")
This implementation is naïve, because it replaces only whitespace characters, so it does not use the correct definition of whitespace, so many other whitespace characters will be omitted. Although it should be the quickest method in this article, it is less functional than others.
But if you just need to remove the real whitespace characters, it's hard to use pure. NET writes more than String.Replace code. Most string methods will fall back to manually optimizing the local C + + code. And String.Replace itself will invoke the C + + method with Comstring.cpp:
FCIMPL3 (object*,
comstring::replacestring,
stringobject* thisrefunsafe,
stringobject* Oldvalueunsafe,
stringobject* newvalueunsafe)
Here is the benchmark Suite method:
public static string Trimallwithstringreplace (String str) {//the ' is not '
functionaly equivalent to the others As it'll only trim "spaces"
//whitespace comprises lots of the other characters return
str. Replace ("", "");
}
The above is. NET to remove the blank string of the 10 methods, I hope to help you learn.