Understanding data structures from a macro level
1. Why is the data structure so important to programming?
Now, based on my own experience, explain why the data structure is so important to our programming. Remember that when you start to learn programming, there's no concept of data structures, sense programming is so, do not use data structure can be compiled a large number of programs, but I can only say that it is some children play a small program, the program is almost no use of data, no matter how you store, the program is running quickly. However, when you write a program for engineering applications, it is processing a large number of data, then can not randomly store data, must be based on the actual situation to choose a suitable data structure to store data, which can greatly improve the processing efficiency of data. For instance, we often use the sort is also the processing of data, we choose different sorting algorithm efficiency is different, when the data volume is very small, we can not feel their differences, however, when we have a large number of data in order to feel their efficiency. Of course it's important to sort the policies at the sort time, but these strategies sometimes depend on the necessary data structures. such as insert sort, select sort, quick sort, etc. may depend on only linear tables, and heap ordering relies on heaps. Therefore, choosing a good data structure may greatly improve the efficiency of the program, and some strategies in solving the problem may also depend on the specific data structure.
2. What is a data structure?
We know the importance of data structure to programming, so what is the data structure? First look at the purpose of data structure birth. In the real world there is a large amount of data, which, however stored in any way, needs to be represented by a structure that can represent not only the data element itself, but also the relationship between data elements, preferably with less storage space. However, the data structure mentioned here can only be said to be the logical structure of the data, that is, it is only abstract in our mind, and in the specific storage also need to use this logic structure to express the real things. Since most of our computer's functions are related to the storage of data and processing data, so the computer as a data carrier and the structure of the relationship is quite large, the computer can be based on our requirements of the data structure to store information. At this point, we can give the data structure a more academic definition: Data structure is a logical structure that describes the relationship between the collection of elements and the data elements. Of course, in many data structure of the book on the data structure is different, and some books will be the data structure of the simple operation is also classified as data structure, of course, this depends on how you understand, after all, the data structure and algorithm is not separated.
3. How the computer describes the data
What is the data structure described in the front, and what are the basic means by which a computer can describe us? First of all, we know that most of the data in the computer on the disk and memory, while the CPU processing data must be read from the disk into memory, because of the precious memory resources, we take the appropriate data structure in memory to save memory space is necessary. when it comes to storing memory on data, we programmers should know that our program needs a certain amount of memory space to run on the computer, which can be easily divided into code area and data area. The code area is where we store our program code, and we can't manage that part of the space. But the data area is where we store the data that our program needs to process, and we just take the data to that place in a reasonable way.
We all know how the computer manages memory by giving an address to each byte of space, so that we can access the memory data through the address. When we store the data, we can store the data in the space of the specified address, when we fetch the data, we can find the corresponding data according to the address, which is called the direct addressing method, and the indirect addressing method, in which the data we find through the address is not the data itself, but the location where the data is stored. Through it to find the real data, of course, indirect addressing can be many times, this is the origin of multidimensional pointers. What does the data structure have to do with the direct and indirect addressing of the majority of the day? Of course, because this is the computer organization data in two basic ways, it is through these two basic methods, our data is stored in memory, and storage may be a continuous address space, it may be a discrete address space. Because of this, the computer's different descriptive forms of data appear. common descriptive forms are: formulaic description, linked list description, indirect description, and analog pointers.
A formulaic description is a formula that calculates the position of an element so that it can be accessed directly from the element. However, this description must ensure that the space used is continuous, because only a contiguous address space can be a fixed offset at a time to find the address of the data. For an array of programming languages, each array has a contiguous space, and the array name marks the first address of the contiguous space, so if you want to access an element of the array, you can find it by adding an offset directly through the first address. Thus the array is a formulaic descriptive data structure that describes the formula F (i) =location (i-1), where I is the first element in the array. For multidimensional arrays, the memory is actually a contiguous space, but the compiler is also formulated to describe the data structure, such as in C + + is a row of the main mapping method to map, two-dimensional array of formula F (i,j) =i*n+j, where I represents the line number, J for the column number, n for the number of columns. Of course, the use of formulaic description of the data structure, such as a hash table, complete binary tree, and so on, the advantage of this description is that many situations can save space, and improve the speed of access to data. But this kind of description also has the disadvantage is often limited, after all, many problems are not described by the formula, and the formulation of the need for continuous space sometimes also appear inflexible. For example, inserting a delete operation on data requires moving the data.
The way the list is described is to store the data in discrete space, and since the space is discrete, the element cannot be accessed by a fixed offset. Therefore, the address of each element can be saved to the previous element, thus forming a linked list. Because of the use of discrete storage, the linked list is more flexible in some data operation. But it also leads to deficiencies, such as the inability to randomly access a node, as well as the extra pointer space.
The indirect description is to save the address of the data to a table, the actual data is stored in memory, when you need to access the data, first find the table to find the address of the data and then access the actual data. This descriptive approach is often a combination of formulaic descriptions and linked list descriptions. This is a good way to describe the actual data elements when they are relatively large.
The analog pointer is used to simulate the pointer's access to the data by an integer. is also considered as discrete storage in memory, but this kind of discretization is limited to a certain extent, because we need to implement the simulation pointer a continuous space simulation heap area, and according to the actual need for this contiguous space renumber, To represent its address with an integer. At the same time also maintain two linked lists, free lists and data linked lists. This is equivalent to the work of allocating memory to the program for the operating system.
4. Macro-understanding of various data structures
in order to facilitate the linkage of various data structures, I have a common structure of the three categories: Linear table, tree, graph. Same, the rest of the data structures are in these three plants based on the actual needs of the expansion. of course, if the three categories still feel a little more, then another pierced to the graph, any data structure can be said to be a diagram, but each has its own characteristics. The following for the common data structure in the three categories of analysis, because this article is only a macro understanding of the data structure, so the details of the implementation of the various structures will not do too much description, want to see the data structure of the relevant books.
4.1 Linear table
There are many data structures in the linear table, such as arrays, matrices, lists, stacks, queues, hops, hash tables, and so on. The one-dimensional array is a typical linear table, the multidimensional array can be regarded as a combination of multiple linear tables, and the description of the array is generally described in a formulaic way. The matrix can be considered as a two-dimensional array, but because there are many kinds of matrices, such as triangular matrices, sparse matrices, such as the description of such a matrix to save space, can be described in a reasonable way, such as the use of linked lists, only 0 elements are saved to the node. The stacks and queues actually add some kind of restriction to the linear table, the stacks are LIFO, the queues are FIFO, and in fact they are a special priority queue, but the rules of precedence are different. Formulations can be used to describe them, but they can also be described using a linked list, but the efficiency is different. For the stack to take a formulaic description is relatively good, access efficiency is O (1), if the list description is a bit of a waste of space, but if it is more than one stack, it is better to use the list description. For queues it is appropriate to use a linked list because it is O (1) to add elements from the head or to delete elements from the tail, but if you use a formulaic description, each deletion requires moving elements, adding overhead.
A hop table and a hash list are two data structures that are often used to describe a dictionary. The common operation of the dictionary is to find, insert, delete, order output, etc. Although the dictionary can also be used to achieve the common array list, but the efficiency is not high. A jump table is an improvement on a linked list. The advantage of the list itself is that the insertion and deletion efficiency is higher than the array, but the lookup efficiency is low, so you can increase the search efficiency by adding additional pointers. The principle of the jump table is based on the idea of binary lookup, we know that in an ordered array of binary lookup on the time complexity of O (Logn), so you can add additional pointers on the ordered list to implement such a search method. However, careful analysis, we will find that to achieve a true binary search is not easy, because the elements in a hop table are not immutable, it is unpredictable to add additional pointers to which element and treat the element as an element on a number of linked lists, thus increasing the complexity of the implementation of the Hop table. In practice, a random method can be used to set an element into a few chain elements, the specific implementation details can refer to the data structure of related books.
A hash table determines the position of an element by a hash function based on the keyword, and is also a formulaic description. In the ideal case, the hash table finds, inserts, delete time complexity can reach O (1), but in the real world because the scope of the keyword change is too large, the ideal hash list implementation requires a lot of space, causing serious waste, so there are different keywords can be mapped to the same location of the hash function, So the question is, how do you deal with a conflict if you map the different keywords to the same location? The two common ways to handle this conflict are linear open addressing hashing and linked list hashing, a linear open addressing hash is where the elements of the same keyword are placed as far as possible in the function map, and if the position already exists, the nearest empty bucket is searched backwards, and the list hash is the conflicting element placed on a list. Both of these methods have their own advantages and disadvantages.
For the performance analysis of the two data structures describing dictionaries, hop table in the best state to find, insert, delete the time complexity is O (K+logn) where K is the chain of series, the worst is O (K+n), and for the use of multiple keywords mapping to the same location of the hash table, the best state to find, insert , the time complexity of the deletion is O (1), but the worst-case status is O (n), so the hash list is always better than the hop table, of course, depends on the actual problem, for example, in the sequential output, the jump table is significantly better than the hash list.
In the linear table of these data structures will find that they are to the ordinary linear table, some of the restrictions on the addition of rules, some add additional auxiliary information, as well as a combination of multiple linear tables. But no matter how it changes, it is still a linear table. Therefore, in the actual development, we can choose them according to the characteristics of different data structures.
4.2 Trees
Trees can be used to describe things that have hierarchies, and the structure of the tree is so amazing that different data structures are formed by adding different restrictions to the tree. If the tree is only about the children we call the two fork tree, in the binary tree by adding a variety of restrictions produced a lot of data structures, such as complete binary tree, heap, Zuo, AVL tree, red-black tree, binary search tree and so on. Here's a detailed description of the tree's data structure.
First, consider a problem in the computer memory why the use of binary tree to store data, rather than using a fork tree? Of course, in order to improve the speed of processing efficiency, in the search of a binary tree node, of course, the number of comparisons is less the better. Consider that a binary lookup in an ordered array is more efficient than a three-point lookup, a four-point lookup, or even more points. The problem is naturally understood.
A complete binary tree is a data structure with a large structure of two-fork trees, and what is the benefit of such data structures, one of the advantages is that this kind of data structure is very convenient to use formulaic description, and greatly saves space.
Limiting the complete binary tree to the maximum tree creates a heap, which is very efficient for describing the priority queue, uses the heap to describe the precedence queue inserts, deletes the efficiency all is O (Logn), and uses the formulation description to be very space-saving. Of course priority queues can also be described with a linear table, but that is inefficient. However, if you want to combine two priority queues and use a heap to describe them as very inefficient, you need to choose another data structure. Zuo is a two-fork tree with a priority limit on the left and right subtrees, as to what to choose as the evaluation factor of priority, the height can be regarded as the factor of evaluation, and the number of nodes can be regarded as the evaluation factor, then the high priority Zuo and the weight priority Zuo are formed respectively. The reason for the precedence restrictions on the left and right subtrees is that it is easy to combine two Zuo into one Zuo after such a restriction is made. The limit for adding Zuo to the maximum tree again forms the maximum Zuo, the maximum (small) Zuo can also describe the precedence queue, and is suitable for merging two trees, but it is less efficient in terms of storage efficiency than heap space savings.
Next, the search tree is another efficient data structure that can describe a dictionary. First to analyze the binary search tree, the binary search tree is the two fork tree node to limit the value, requires that each node's value is larger than the Zuozi and smaller than the right subtree, plus this limit, the search efficiency of an element is relatively high, in the best case to find, insert, delete operation time complexity can reach O ( LOGN), however, the worst case has been achieved O (n), resulting in the worst case is due to the extreme imbalance of the binary tree, in order to solve this problem, the balance tree mixed in, the balanced binary search tree is not a good solution to this problem? However, after each insert delete operation, the AVL tree needs to rotate several times in order to maintain the balance, thus reducing the efficiency. The appearance of red and black trees is a good solution to this problem, red and black trees, although not fully balanced two-tree but also counted on the basic balance, however, the red-black tree for the insertion of the deletion operation after the maintenance of the characteristics of the red-black tree cost is not high. In practical applications, many dictionaries are described with red and black trees. In addition to the binary search tree, multi-fork search tree is also used in many places, for example, when reading disk data, you can use B-tree to build the index, because the cost of reading a disk is relatively large, so read the number of disk less the better, theoretically speaking, the height of the tree is better. In order to improve indexing speed, many databases are indexed by B + trees. In addition, because English words are usually pieced together with letters, so the English words stored in the fork tree can greatly improve the efficiency of search words, this is the famous trie tree.
through the above data structure produced by the tree, it is very efficient to maintain a fixed structure by adding various restrictions to the tree, so in real application, we should choose the appropriate data structure or transform some data structure according to the actual need. Into the data structure that really suits you.
4.3 figure
It is most convenient to use graphs to describe the myriad things that are extremely connected. However, according to the specific problems produced by the graph is not the same, such as some things with a map description is more appropriate, some use the undirected graph description is more appropriate, some with a complete picture description is more appropriate, some use the connection diagram description is more suitable, some use the second picture description is more appropriate , regardless of the actual problem to use a different way, of course, in solving practical problems, often need to combine the necessary algorithm.
The description of graphs is often described by adjacency matrices and adjacency lists.