Transferred from: http://www.cnblogs.com/gaochundong/p/3813252.html#!comments
Data Structure |
Add |
Find |
Delete |
Getbyindex |
Array (t[]) |
O (N) |
O (N) |
O (N) |
O (1) |
Linked list (linkedlist<t>) |
O (1) |
O (N) |
O (N) |
O (N) |
Resizable array list (list<t>) |
O (1) |
O (N) |
O (N) |
O (1) |
Stack (stack<t>) |
O (1) |
- |
O (1) |
- |
Queue (queue<t>) |
O (1) |
- |
O (1) |
- |
Hash table (dictionary<k,t>) |
O (1) |
O (1) |
O (1) |
- |
tree-based Dictionary (sorteddictionary<k,t>) |
O (log n) |
O (log n) |
O (log n) |
- |
Hash table based set (hashset<t>) |
O (1) |
O (1) |
O (1) |
- |
Tree based set (sortedset<t>) |
O (log n) |
O (log n) |
O (log n) |
- |
How to choose a data structure
Array (t[])
- When the number of elements is fixed and the subscript needs to be used.
Linked list (linkedlist<t>)
- When an element needs to be able to be added at both ends of the list. Otherwise, use list<t>.
Resizable array list (list<t>)
- When the number of elements is not fixed and the subscript needs to be used.
Stack (stack<t>)
- When you need to implement LIFO (last on first out).
Queue (queue<t>)
- When you need to implement FIFO (first in first out).
Hash table (dictionary<k,t>)
- When you need to use key-value pairs (Key-value) to quickly add and find, and the elements do not have a specific order.
tree-based Dictionary (sorteddictionary<k,t>)
- When use value pairs (Key-value) are required to quickly add and find, and the elements are sorted according to Key.
Hash table based set (hashset<t>)
- When you need to save a unique set of values, and the elements do not have a specific order.
Tree based set (sortedset<t>)
- When you need to save a unique set of values, and the elements need to be sorted.
Array
In computer programming, array is one of the simplest and most widely used data structures. In any programming language, arrays have some commonality:
- The contents of the array are stored using contiguous memory (contiguous memories).
- All elements in an array must be of the same type, or a derived type of the type. The array is therefore considered to be a homogeneous data structure (homegeneous data structures).
- The elements of the array can be accessed directly. For example, if you need to access the element I elements of an array, you can access it directly using Arrayname[i].
General operations for arrays include the following:
- Allocate Space (Allocation)
- Data access (accessing)
In C #, you can declare an array variable in the following way.
1 int allocationsize = 10;2 bool[] Booleanarray = new Bool[allocationsize];3 fileinfo[] Fileinfoarray = new FileInfo[alloc Ationsize];
The above code allocates a contiguous amount of memory space in the CLR managed heap to accommodate an array element with a quantity of allocationsize and a type of arraytype. If ArrayType is a value type, a arraytype value of Allocationsize (unboxed) will be created. If ArrayType is a reference type, a reference to the Allocationsize arraytype type will be created.
If we assign values to some locations in the fileinfo[] array, the reference relationship is shown.
Arrays in. NET support Direct read and write operations on elements. The syntax is as follows:
1//reading group element 2 bool B = booleanarray[7];3 4//write array element 5 booleanarray[0] = false;
The time complexity of accessing an array element is O (1), so access time is constant for the log group. That is, there is no direct relationship to the number of elements contained in the array, and the time to access an element is the same.
ArrayList
Since arrays are fixed-length, only derived types of the same type or type can be stored in an array. This is subject to some restrictions in use.. NET provides a data structure ArrayList to address these issues.
1 ArrayList countdown = new ArrayList (); 2 Countdown.add (3); 3 Countdown.add (2); 4 Countdown.add (1); 5 Countdown.add ("Blast off!"); 6 Countdown.add (New ArrayList ());
ArrayList is a variable-length array, and it can store different types of elements.
But these flexibility is at the expense of performance. In the description of the above array, we know that the array is not boxed (unboxed) when storing the value type. Because the Add method of ArrayList accepts a parameter of type object, a boxing (boxing) operation occurs if the value of the value type is added. This can incur additional overhead when reading and writing ArrayList frequently, resulting in degraded performance.
List<t>
When generic functionality is introduced in. NET, the performance cost of the above ArrayList can be eliminated by using generics.. NET provides a new array type of list<t>.
Generics allow developers to defer the selection of data types when creating data structures, until they are used to determine which type to choose. The main benefits of generics (generics) include:
- type Safety: Use a type defined by a generic type that can only be used with a derived type of the specified type or type.
- Performance (performance): Generics Remove run-time type detection, eliminating the overhead of boxing and unpacking.
- Reusable (reusability): Generics break the tight coupling between data structures and stored data types. This improves the reusability of the data structure.
List<t> is equivalent to a homogeneous one-dimensional array (homogeneous self-redimensioning array). It can read elements quickly like an Array, and it can maintain flexibility in length.
1 //create int type list 2 list<int> myfavoriteintegers = new List<int> (); 3 4 //Create String Type list 5 list& lt;string> friendsnames = new list<string> ();
List<t> also uses an Array to implement it, but it hides the complexity of these implementations. There is no need to specify an initial length when creating list<t>, and when adding elements to list<t>, there is no need to worry about resizing (resize) the array size.
1 list<int> powersOf2 = new List<int> (); 2 3 powersof2.add (1); 4 Powersof2.add (2); 5 6 POWERSOF2[1] = 10;7 8 int sum = powersof2[1] + powersof2[2];
The List<t> progressive runtime (asymptotic Running time) is the same complexity as the Array.
Linkedlist<t>
In a linked list (Linked list), each element points to the next element, which forms a chain (chain).
When creating a linked list, we only have to hold the head node reference, so that all nodes can be found by traversing the next node one by one.
The linked list has the same linear run time O (n) as the array. For example, if we were looking for a SAM node, we would have to start looking at Scott from the head node, traversing the next node one by one until we found Sam.
Similarly, the progressive time to remove a node from a linked list is also linear O (n). Because we still need to iterate from head to find the node that needs to be deleted before we delete it. The delete operation itself becomes simple, which means that the next pointer to the left node of the deleted node points to its right node. Shows how to delete a node.
The progressive time of inserting a new node into the list depends on whether the linked list is ordered. If the list does not need to be kept in order, the insert operation is the constant time O (1), which adds a new node to the head or tail of the list. If you need to maintain the sequential structure of the list, you need to find the location where the new node will be inserted, which makes it necessary to traverse the head of the linked list one by one, and the result is that the operation becomes O (n). Shows an example of an Insert node.
The difference between a linked list and an array is that the contents of the array are continuously arranged in memory and can be accessed by subscript, while the order of the contents in the list is determined by the pointers of each object, which determines that the arrangement of the contents is not necessarily continuous, so it cannot be accessed by subscript. Using an array may be a better choice if you need a faster find operation.
The main advantage of using a linked list is that inserting or deleting nodes into a linked list does not require the capacity of the structure to be resized. Instead, the capacity is always fixed for arrays, and if more data needs to be stored, the capacity of the array needs to be adjusted, resulting in a series of complex and inefficient operations such as new arrays, data copies, and so on. Even the List<t> class, although it hides the complexity of capacity tuning, is still difficult to escape from the penalty of performance loss.
Another advantage of a linked list is that it is especially suitable for adding new elements dynamically in a sorted order. If you want to add a new element somewhere in the middle of the array, you not only move all the remaining elements, but you may even need to readjust the capacity.
So, in summary, the number of arrays suitable for data is capped, and the list is suitable for cases where the number of elements is not fixed.
The Linkedlist<t> class has been built into. NET, which implements a doubly linked list (doubly-linked list) function, which means that the node holds a reference to its left and right nodes at the same time. For a delete operation, if you use Remove (T), the operation complexity is O (n), where n is the length of the linked list. If you use Remove (linkedlistnode<t>), the operation complexity is O (1).
Queue<t>
. NET gives us a queue<t> when we need to use a first-in-order (FIFO) data structure. The Queue<t> class provides Enqueue and Dequeue methods to enable access to queue<t>.
Inside the queue<t>, a ring array that holds the T object is built and is pointed to the head and tail of the array through the head and tail variables.
By default,,queue<t> has an initial capacity of 32, or you can specify the capacity through the constructor function.
The Enqueue method will determine if there is sufficient capacity in the queue<t> to hold the new element. If so, add the element directly and increment the index tail. Here tail uses the modulo operation to ensure that the tail does not exceed the array length. If the capacity is insufficient, the queue<t> expands the array capacity based on the specific growth factor.
By default, the value of the growth factor (growth factor) is 2.0, so the length of the internal array is incremented by one times. You can also specify the growth factor through the constructor. The capacity of the queue<t> can also be reduced by the TrimExcess method.
The Dequeue method returns the current element based on the head index, then points the head index to null and then increments the head value.
Stack<t>
When you need to use a last-in-first-out order (LIFO) data structure,. NET provides us with stack<t>. The Stack<t> class provides Push and Pop methods to enable access to the stack<t>.
The elements stored in the stack<t> can be represented by a vertical set of images. When the new element is pressed into the stack (Push), the new element is placed at the top of all other elements. When a stack (pop) is needed, the element is removed from the top.
The default capacity of the stack<t> is 10. and queue<t> the initial capacity similar to,stack<t> can also be specified in the constructor. The capacity of the stack<t> can be automatically expanded according to the actual use, and the capacity can be reduced by the TrimExcess method.
If the number of elements in the stack<t> count is less than its capacity, the complexity of the Push operation is O (1). If the capacity needs to be expanded, the complexity of the Push operation becomes O (n). The complexity of the Pop operation is always O (1).
Hashtable
Now suppose we want to use the employee's social Security number as a unique identifier for storage. The social security number is in the form DDD-DD-DDDD (the range of D is number 0-9).
If you use an array to store employee information, to query an employee with a social security number of 111-22-3333, you will attempt to traverse all locations of the array, that is, a query operation with a progressive time of O (n). A better approach would be to sort the social Security numbers so that the query progressive time is reduced to O (log (n)). But ideally, we would prefer the query to have a progressive time of O (1).
One option is to create a large array, ranging from 000-00-0000 to 999-99-9999.
The disadvantage of this scheme is that it wastes space. If we only need to store 1000 employees ' information, we only use 0.0001% of the space.
The second option is to compress the sequence with a hash function .
We chose to use the latter four digits of the social Security Number as an index to reduce the span of the interval. Such ranges will range from 0000 to 9999.
Mathematically, this conversion from 9-digit to 4-digit is called a hash conversion (Hashing). You can compress an array of index spaces (indexers space) into the appropriate hash table.
In the above example, the input of the hash function is a 9-digit social Security number, and the output is the latter 4 bits.
H (x) = last four digits of X
Also shows a common behavior in hash function calculations: hash Collisions (hash collisions). It is possible that the latter 4 digits of the two social Security numbers are 0000.
When you add a new element to Hashtable, a hash conflict is a factor that causes the operation to be corrupted. If no conflict occurs, the element is successfully inserted. If a conflict occurs, you need to determine the cause of the conflict. As a result, hash collisions increase the cost of operations, and Hashtable's design goal is to minimize the occurrence of conflicts .
There are two ways to handle hash collisions: avoidance and resolution, which is the conflict avoidance mechanism (collision avoidance) and the conflict resolution mechanism (collision Resolution).
One way to avoid hash collisions is to select the appropriate hash function. The probability of a conflict occurring in a hash function is related to the distribution of the data. For example, if the latter 4 digits of the social Security number are immediately distributed, then it is appropriate to use the latter 4 digits. However, if the latter 4 are allocated in the year of birth of the employee, it is obvious that the year of birth is not evenly distributed, then the choice of the latter 4 will cause a lot of conflicts. We refer to this method of choosing the appropriate hash function as the collision avoidance mechanism (collision avoidance).
There are a number of policies that can be implemented when dealing with conflicts, called conflict resolution mechanisms (collision Resolution). One way to do this is to put the inserted element in another block space, because the same hash position is already occupied.
The commonly used conflict resolution strategy is open addressing, where all elements are still stored in arrays within the hash table.
One of the simplest implementations of open addressing is linear probing (Linear probing)with the following steps:
- When a new element is inserted, the hash function is used to position the element in the hash table;
- Checks if the element exists in the hash table for that position. If the location content is empty, insert and return, otherwise turn to step 3.
- If the location is I, check if i+1 is empty, check i+2 if it is already occupied, and so on, until you find a location where the content is empty.
Now if we want to insert information from five employees into a hash table:
- Alice (333-33-1234)
- Bob (444-44-1234)
- Cal (555-55-1237)
- Danny (000-00-1235)
- Edward (111-00-1235)
Then the inserted hash table might look like this:
The insertion process of the element:
- Alice's Social Security number is hashed to 1234, so it is stored in position 1234.
- Bob's Social Security number is hashed to 1234, but since location 1234 has already stored Alice's information, check the next location 1235,1235 is empty, then Bob's information is put to 1235.
- Cal's Social Security number is hashed to the 1237,1237 location, so the CAL is placed at 1237.
- Danny's Social Security number is hashed as 1235,1235 has been occupied, then check whether 1236 bit is empty, 1236 is empty, so Danny is put to 1236.
- Edward's Social Security number is hashed as 1235,1235 has been occupied, check 1236, also occupied, and then check 1237, until the check to 1238, the location is empty, so Edward was placed in 1238 position.
The linear probing (Linear probing) approach is simple, but not the best strategy for resolving conflicts, because it causes the aggregation of homogeneous hashes (Primary clustering). This causes the conflict to persist when searching the hash table. For example, the hash table in the example above, if we want to access Edward's information, because Edward's social security number 111-00-1235 hash is 1235, but we found in 1235 position is Bob, so then search 1236, find Danny, and so on until found E Dward.
An improved way is two probes (quadratic probing), that is, the step size of each check position space is a square multiplier. That is, if position s is occupied, first check for S + 12, then check S-12,s + 22,s-22,s + 32 and so on, instead of S + 1,s + 2 as linear probing ... Way to grow. However, two probes can also cause similar hash aggregation problems (secondary clustering).
The implementation of the Hashtable class in. NET requires that the element be added not only to provide the element (Item), but also to provide a key (key) for the element. For example, Key is the employee Social Security number, and Item is the employee information object. You can find Item by using Key as an index.
1 Hashtable employees = new Hashtable (); 2 3 //ADD Some values to the Hashtable, indexed by a string key 4
employees. ADD ("111-22-3333", "Scott"); 5 employees. ADD ("222-33-4444", "Sam"); 6 employees. ADD ("333-44-55555", "Jisun"); 7 8 //Access a particular key 9 if (employees. ContainsKey ("111-22-3333")) ten {one string empname = (string) employees["111-22-3333"];12 Console.WriteLine ("Employee 111-22-3333 ' name is:" + empname); }14 else15 Console.WriteLine (" Employee 111-22-3333 is not in the hash table ... ");
The hash function in the Hashtable class is more complex than the implementation of the social Security number described earlier. The hash function must return an ordinal number (Ordinal Value). For example, the social security number can be achieved by intercepting the post four bits. But in fact, the Hashtable class can accept any type of value as Key, thanks to the GetHashCode method, a method defined in System.Object. The default implementation of GetHashCode returns a unique integer that is guaranteed to remain constant for the lifetime of the object.
The hash function in the Hashtable class is defined as follows:
H (key) = [Gethash (key) + 1 + ((Gethash (key) >> 5) + 1)% (hashsize–1))]% hashsize
The Gethash (key) Here defaults to the GetHashCode method that invokes key to get the returned hash value. Hashsize refers to the length of a hash table. The final result of the H (key) range is between 0 and hashsize-1 because it is to be modeled.
When you add or get an element in a hash table, a hash conflict occurs. In the previous article we briefly introduced two conflict resolution strategies:
- Linear probing (Linear probing)
- Two probes (quadratic probing)
A completely different technique is used in the Hashtable class, called a two- degree hash (rehashing), which is also referred to as a double-hash (double hashing) in some materials.
The two-degree hash works as follows:
There is a H1 that contains a set of hash functions ... A collection of Hn. When you need to add or get elements from a hash table, first use the hash function H1. If this causes a conflict, try using H2, and so on, until Hn. All hash functions are very similar to H1, but the multiplication factor (multiplicative factor) they choose is different.
Generally, the hash function Hk is defined as follows:
Hk (key) = [Gethash (key) + K * (1 + ((Gethash (key) >> 5) + 1)% (hashsize–1))]% hashsize
When using a two-degree hash, it is important that each location in the hash table is accessed only once after the Hashsize probe has been performed. That is, for a given key, the same position in the hash table does not use both the Hi and the Hj. The two-degree hash formula is used in the Hashtable class, and it is always maintained (1 + (((Gethash (key) >> 5) + 1)% (hashsize–1) and hashsize each other as prime numbers (two of the mutual prime numbers mean that they do not have a common quality factor).
The two-degree hash uses the θ (m2) probe sequence, while linear probing (Linear probing) and two probes (quadratic probing) use the θ (m) probe sequence, so a two-degree hash provides a better strategy for avoiding conflicts.
The Hashtable class contains a private member variable loadfactor,loadfactor specifies the maximum ratio between the number of elements in the hash table and the number of positions (slots). For example, if Loadfactor equals 0.5, then only half of the space in the hash table holds the element value, and the other half is empty.
The hash table constructor allows the user to specify the Loadfactor value, defined as 0.1 to 1.0. However, no matter what value you provide, the range will not exceed 72%. Even if you pass a value of 1.0,hashtable class, the Loadfactor value is still 0.72. Microsoft believes that the best value for Loadfactor is 0.72, which balances speed with space. So although the default loadfactor is 1.0, it is automatically changed to 0.72 inside the system. Therefore, it is recommended that you use the default value of 1.0 (but actually 0.72).
When you add a new element to Hashtable, you need to check to ensure that the ratio of elements to space does not exceed the maximum scale. If this is exceeded, the hash table space will be expanded. The steps are as follows:
- The hash table's position space is almost doubled. accurately, the position space value increases from the current prime value to the next largest prime value.
- Because of the two-degree hash, all the element values in the hash table will depend on the location space value of the Hashtable, so all values in the table also need to be re-two-degree hashes.
As a result, the expansion of the hash table will be at the expense of performance loss. Therefore, we should pre-estimate the number of elements in the hash table that are most likely to fit, and construct the appropriate values when initializing the hashtable to avoid unnecessary expansions.
Dictionary<k,t>
The Hashtable class is a loosely-coupled type of data structure that developers can specify as either a Key or an Item. When. NET introduces generic support, the type-safe dictionary<k,t> class appears. Dictionary<k,t> uses strong typing to restrict keys and item, and when you create an dictionary<k,t> instance, you must specify the type of key and item.
Dictionary<keytype, valuetype> variableName = new Dictionary<keytype, valuetype> ();
If you continue to use the Social Security number and employee examples described above, we can create an instance of dictionary<k,t>:
Dictionary<int, employee> employeedata = new Dictionary<int, employee> ();
This allows us to add and remove employee information.
1//Add some employees2 employeedata.add (455110189) = new Employee ("Scott Mitchell"); 3 Employeedata.add (455110191) = new Employee ("Jisun Lee"); 4 5//See if employee with SSN 123-45-6789 works here6 if (Employeedata.containskey (123456789))
Dictionary<k,t> differs from Hashtable in more than one place. In addition to supporting strongly typed external,dictionary<k,t>, a different conflict resolution strategy (Collision Resolution strategy) is used, which is called link technology (chaining).
The profiling technique (probing) used earlier, and if a conflict occurs, the next location in the list is attempted. Using a two-degree hash (rehashing) causes all hashes to be recalculated. The link technology (chaining) uses additional data structures to handle conflicts. Each location (slot) in the dictionary<k,t> is mapped to a linked list. When a conflict occurs, the elements of the conflict are added to the bucket list.
The following describes that each bucket in dictionary<k,t> contains a list of elements that store the same hash.
, the Dictionary contains 8 barrels, which is the position of a top-down yellow background. A certain number of Employee objects have been added to the Dictionary. If a new Employee is to be added to the Dictionary, it will be added to the bucket corresponding to the hash of its Key. If an Employee already exists in the same location, the new element will be added to the front of the list.
The operation of adding elements to Dictionary involves a hash calculation and a list operation, but it is still constant and the progressive time is O (1).
When querying and deleting Dictionary, the average time depends on the number of elements in the Dictionary and the number of buckets (buckets). Specifically, the run time is O (n/m), where n is the total number of elements, and M is the number of buckets. But Dictionary is almost always implemented as N = O (m), that is, the total number of elements never exceeds the total number of buckets, so O (n/m) becomes a constant O (1).
Resources
- An extensive examination of data Structures Using C # 2.0:part 1:an Introduction to Data structures
- An extensive examination of Data structures Using C # 2.0:part 2:the Queue, Stack, and Hashtable
- Review data structure-Part I: Data structure introduction [translate]
- Examining data Structures-Part II: Queues, stacks, and hash tables [translate]
Time complexity of commonly used data structures