Character Set Combination Based on tree bit compression Array

Source: Internet
Author: User
ArticleDirectory
    • 3.1 exceptwith operation
    • 3.2 intersectwith operation
    • 3.3 symmetricexceptwith operation
    • 3.4 unionwith operation
    • 3.5 calculate the number of 1 in binary

. The built-in set (hashset <t>) in, it is difficult to effectively optimize set operations. If the data stored in the collection is specific to characters, the bit compression array can be used to achieve high insertion and deletion efficiency and high set operation efficiency.

I. Tree-based compression Array Storage Character Set combination

Bitarray in C # Is a bitarray that compresses 32 boolean values into an int []. A single bit of int represents a Boolean value to save memory. Here I use a tree-based bitwise compressed array. That is to say, it is actually a tree, not a simple array.

Here we will analyze the characteristics of a character set combination:

    1. Like other sets, the stored values are unique and used as character sets. The stored data is a char structure.
    2. The character range is limited and relatively small, only 0x0000 ~ 2 ^ 16 = 65536 between 0 xFFFF, and 8 KB can be fully stored if an array is compressed with bits.
    3. Character distribution is slightly more regular, 0 ~ 127 is the ASCII code range and is used frequently. The subsequent characters are rarely used and sparse.

From the preceding three points, if 8 KB bits are directly used to compress the array, a lot of unnecessary space will be wasted. Therefore, the best way is to layer the Array (organized into a tree) and perform special processing on characters in the ASCII code range to speed up the process. Subsequent characters are dynamically added as needed, this can effectively reduce the waste of space. Of course, there cannot be too many layers, so the processing will be too complicated and the efficiency will be relatively low; there cannot be too few layers, which will lead to a large waste of space.

Finally, I divide the bitwise compressed array into three layers, using uint32 as the underlying storage unit. There are a total of 16 top-level units. Each top-level unit corresponds to 16 middle-layer units (a total of 256 middle-level units). Each middle-level unit corresponds to 8 bottom-layer units, with a total of 2048 bottom-layer units, each character corresponds to one character. The first eight underlying units correspond to 256 characters, which is exactly the extended ASCII code (eascii) and can be separately considered to speed up. Storage Methods are similar:

Figure 1 storage mode of charset

Initially, only the first eight storage units of eascii are represented, and the subsequent storage units are all null. In this way, space waste can be avoided to the maximum extent and can be added dynamically when needed later. The following character mask can be obtained based on the layer:

Figure 2 char mask

When storing characters, you only need to use bit operations to store the corresponding positions based on the mask, and the efficiency is not low. In addition, because the set is stored based on the principle of bit compression array, 32 bits can be operated at a time during some set operations, which can effectively improve the efficiency of set operations. If the number of elements in the set is very large, the method of compressing arrays by bit can greatly reduce memory consumption. hashset <char> requires two int values (hashcode and next) for each element) and a char (data) with a total of 7B memory, when the collection is close to full, it needs to occupy hundreds of KB of memory, the space occupied by bit compression arrays is less than 10 KB.

If the character set combination is case insensitive, the situation will be much more complicated. Although you can convert all values to uppercase or lowercase values when adding, deleting, and searching, the getenumerator and set operations are more complex. When traversing a set, the original character must be correctly output (rather than only lowercase or uppercase characters). Therefore, the original character must be stored in uppercase or lowercase.

One storage policy is: if it is an upper-case character, it is directly saved in the Set; if it is a lower-case character, it is saved in the set as upper-case characters and lower-case characters (because it is not case-insensitive, therefore, it is clear that uppercase and lowercase characters are not stored in the collection ). In this way, you only need to determine whether the corresponding lower-case characters exist in the set, you can correctly distinguish whether the originally stored characters are in upper or lower case. Although this policy can process the traversal of a set correctly, when a set operation is performed, lowercase characters are operated twice (because the set stores the corresponding uppercase characters ), the length of the set is incorrect.

Therefore, you can only use another policy-if you actually store lower-case characters, you can only save the corresponding upper-case characters in the Set, and use another set to mark the current position as lower-case characters. In practice, I have extended 8 underlying units corresponding to a middle-level unit to 16 underlying units. The first eight units are used to store characters, the last eight units store whether the characters at the corresponding locations are originally lowercase characters. This will use twice the storage space, but the advantage is that the efficiency of the set operation will only have a small loss, and the bit compression array is actually very space-saving, this is not a problem.

Ii. Basic operations

The core of adding, deleting, and searching a set is to find the uint32 stored in the character and the corresponding bit mask. To access the character C, the formula data [C> 12] [(C> 8) & 0xf] [(C> 5) can be used) & 7] & (1 <(C & 0x1f), specificCodeAs follows:

Private uint [] findmask (int c, out int idx, out uint binidx) {uint [] arr = NULL; If (C <= 0xff) {arr = eascii ;} else {idx = C> 12; binidx = 0; uint [] [] arrmid = data [idx]; If (arrmid = NULL) {return NULL ;} idx = (C> 8) & 0xf; arr = arrmid [idx]; If (ARR = NULL) {return NULL ;}} idx = (C> 5) & 7; binidx = 1u <(C & 0x1f); Return arr;} private uint [] findandcreatemask (int c, out int idx, out uint binidx) {uint [] arr = NULL; If (C <= 0xff) {arr = eascii;} else {idx = C> 12; uint [] [] arrmid = data [idx]; If (arrmid = NULL) {arrmid = new uint [16] []; data [idx] = arrmid ;} idx = (C> 8) & 0xf; arr = arrmid [idx]; If (ARR = NULL) {arr = new uint [8]; arrmid [idx] = arr ;}} idx = (C> 5) & 7; binidx = 1u <(C & indexmask); Return arr ;}

The two codes are basically the same, but one is only responsible for searching, and the other creates a storage unit when the search fails. Note that for case-insensitive character sets, a middle-level unit corresponds to 16 bottom-level units.

This is also because the array is compressed by bits. The use of ienumerator <char> enumerative characters is cumbersome and the original characters need to be assembled Based on the index. If it is a case-insensitive Character Set combination, you must obtain the case-insensitive information based on the last eight uint values of the corresponding storage unit.

 
Public override ienumerator <char> getenumerator () {for (INT I = 0; I <16; I ++) {uint [] [] arr1 = data [I]; if (arr1 = NULL) {continue;} int C1 = I <12; For (Int J = 0; j <16; j ++) {uint [] arr2 = arr1 [J]; If (arr2 = NULL) {continue;} int C2 = C1 | (j <8 ); for (int K = 0; k <8; k ++) {int C3 = c2 | (k <5); uint value = arr2 [k]; uint valueig = ignorecase? Arr2 [K + 8]: 0; For (INT n =-1; value> 0;) {int oneidx = (Value & 1) = 1? 1: value. bintrailingzerocount () + 1; if (oneidx = 32. Value = 0;} else {value = value> oneidx;} n + = oneidx; char C4 = (char) (C3 | N ); if (valueig & (1 <n)> 0) {C4 = char. tolower (C4, culture);} yield return C4 ;}}}}}

In the code, there is an extension method bintrailingzerocount, which is used to calculate the number of zeros at the end of the binary representation. In this way, a valid bit can be obtained through one shift, instead of having to shift right multiple times, the code for this method is as follows:

 
Private Static readonly int [] multiplydebruijnbitposition32 = new int [] {0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; public static int bintrailingzerocount (this uint value) {return multiplydebruijnbitposition32 [(uint) (Value &-value) * 0x077cb531u)> 27];}

This method requires only two bitwise operations, one multiplication, and one array access to obtain the number of zeros at the end of the result. The efficiency is very high, but the principle is complicated. For details, refer to here.

Iii. Set Operations

The biggest advantage of charset is that the set operation is extremely fast. The tree storage method can avoid many unnecessary comparisons, and can operate 32 characters at a time, which is not efficient for sparse data operations. All operations first traverse the tree, find the node corresponding to the current set and the other set, and then use the bit operation to calculate the node value. The following describes how to perform bitwise operations.

3.1 exceptwith operation

The except twith operation is used to exclude all characters in the other set from the current set. You can use thisvalue & = ~ Othervalue sets the bit where othervalue is 1 to 0. In this case, you also need to calculate the number of excluded characters, that is, to find the value of thisvalue as 1 and the value of othervalue as 1, that is, thisvalue & othervalue, and then calculate the number of binary 1. The code for the exceptwith method is given below. This code is only applicable to operations in the same set and cannot be used with operations in other sets.

 
Private void pair twith (charset other) {for (INT I = 0; I <16; I ++) {uint [] [] arrmid = data [I]; if (arrmid = NULL) {continue;} uint [] [] otherarrmid = Other. data [I]; If (otherarrmid = NULL) {continue;} For (Int J = 0; j <16; j ++) {uint [] arrbottom = arrmid [J]; If (arrbottom = NULL) {continue;} uint [] otherarrbottom = otherarrmid [J]; If (otherarrbottom = NULL) {continue;} For (int K = 0; k <8; k ++) {uint Remo Ved = arrbottom [k] & otherarrbottom [k]; If (removed> 0) {This. Count-= removed. binonecnt (); arrbottom [k] & = ~ Removed; If (ignorecase) {arrbottom [K + 8] & = ~ Removed ;}}}}}}
3.2 intersectwith operation

The intersectwith operation allows the current set to only contain elements in the specified set, which is opposite to the previous operation. Therefore, you can use thisvalue & = othervalue to set the bit where othervalue is 0 to 0. The number of excluded characters is 1 in thisvalue and 0 in othervalue, that is, thisvalue &~ Othervalue.

3.3 symmetricexceptwith operation

The symmetricexceptwith operation is similar to an exclusive or operation, which causes the current set to only contain elements existing in the current set or the specified set (but not the elements that both exist ), the binary operation is naturally thisvalue ^ = oherint. However, it is difficult to calculate the number of modified characters here, because the number of thisvalue to be changed from 0 to 1 must be added, and the number of thisvalue to 0 must be subtracted from 1, in any case, we need to calculate the number of 1 for two times. Therefore, we can directly calculate the number of 1 for an exclusive, forward, or backward, and then subtract it.

3.4 unionwith operation

The unionwith operation allows the current set to contain all elements that exist simultaneously in the current set and the specified set. Therefore, the operation using thisvalue | = othervalue is used. The number of characters added is the bit where thisvalue is 0 and othervalue is 1 ~ Thisvalue & othervalue calculation.

3.5 calculate the number of 1 in binary

To calculate the number of 1 in binary data, you do not need to judge the number with 32 cycles. You can use the binary method to obtain the result through the 15-bit operation and 5 addition operations:

Public static int binonecnt (this uint value) {value = (Value & 0x55555555) + (value> 1) & 0x55555555 ); value = (Value & 0x33333333) + (value> 2) & 0x33333333); value = (Value & 0x0f0f0f) + (value> 4) & 0x0f0f0f); value = (Value & 0x00ff00ff) + (value> 8) & 0x00ff00ff); value = (Value & 0x0000ffff) + (value> 16) & 0x0000ffff); Return (INT) value ;}

The idea and basic operations of charset are as follows. For the complete code, refer to the charset class, its insertion and deletion efficiency is about 70% of hashset <char> (the main factor is to divide the array into three layers), but the collection operation efficiency is generally higher than hashset <char>, when the collection is relatively large (several thousand), the efficiency can be improved several times.

Later, I tested the two-layer array (the top layer length is 64 and the bottom layer length is 32) and found that the memory usage is not a problem, and the efficiency is slightly improved, it seems that some simple data structures can bring many benefits.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.