Note: This blog is my understanding of the data structure, many local understanding may not be appropriate, but also ask the reader to learn dialectically

Understanding data structures from the macro level

Many times we have been working hard, but don't know why ...

After a year of work, re-think of the university's data structure, found that there are few left, when the mention of a data structure in the mind generally only left a simple definition, such as jumping table, but also just fuzzy remember is to add extra pointers on an ordered list to speed up the search speed, the other seems to remember nothing, Remember that at that time in the learning data structure of the jump table understanding is still pretty deep, but the time a long but forget about the jump table most of the content. The reason why forget so fast, on the one hand is because not often review outside, there is another important reason is not from the macro to understand the data structure, not the knowledge of jumping table and other related knowledge linked, so that a lonely knowledge quickly forgotten. In fact, learning is such a process, in the learning of a knowledge, if you do not understand its principle or can not be associated with their familiar knowledge, this knowledge may soon be forgotten, if you can connect, then in the ordinary life, The use of knowledge that you are familiar with may naturally be linked to this knowledge point, and it also plays a role in consolidating this knowledge. Knowledge is such a continuous accumulation of process, the ever-changing thinking is still a kind of information storage and processing, do not choose a suitable storage way to store information, the amount of storage of less or processing becomes very inefficient. The inefficient processing of information in addition to the effects of your physiological function, your way of thinking also has a great impact. Analogy computer processing information ability, the impact factor in addition to its own hardware performance, how the software implementation also has a great relationship. The data structure is a course that teaches us how to improve the performance of our programs in addition to the hardware impact.

Gossip pulls a lot, but that is still the hard truth. Under my own experience, I will explain how to understand the data structure from the macroscopic, so that we can have a macroscopic knowledge of the data structure that we have learned, and then facilitate the development of the data structure knowledge.

1. Why is data structure so important to programming?

Now, according to my own experience, I will explain to you why the data structure is so important to our programming. Remember in the beginning to learn programming, there is no concept of data structure, the sense of programming is so, do not have a data structure can be compiled a lot of programs, but I can only say that it is a little child play the small program, the program almost no use of how much data, no matter how you store, the program runs quickly. However, when you write programs for engineering applications, it is to deal with a large number of data, then can not randomly stored data, you must choose a suitable data structure according to the actual situation to store data, which can greatly improve the efficiency of data processing. For example, we often use the sort of data processing, we choose different sorting algorithm efficiency is different, when the amount of data is very small, we do not feel their differences, but when we sort a lot of data can feel their efficiency. Of course, sorting strategies are important when sorting, however these policies sometimes depend on the necessary data structures. such as insert sort, select sort, quick sort, etc. may depend on only linear table, and heap sort depends on heap. Therefore, choosing a good data structure can greatly improve the efficiency of the program, and the strategy of solving the problem may also depend on the specific data structure.

2. What is a data structure?

We know the importance of data structure to programming, what exactly is a data structure? First look at the purpose of data structure birth. There is a lot of data in the real world, and that data, regardless of how it is stored, needs to be represented by a structure that represents not only the data elements themselves, but also the relationships between the data elements, and it is best that the structure can occupy less storage space. However, the data structure described here can only be described as the logical structure of the data, that is, it is only abstract in our mind, but in the specific storage also need to use this logical structure to express the reality. Since most of our computer functions are related to storing data and processing data, the relationship between the computer and the data structure is quite large, and the computer can store the data according to the structure we request. At this point, we can give the data structure a comparative academic definition: Data structure is used to describe the elements and the relationship between the various data elements of the logical structure. Of course, in many of the data structure of the books on the definition of data structure is different, and some books will be on the data structure of the simple operation of the data structure is also classified as the content, of course, this depends on how you understand, after all, data structures and algorithms are not separated.

3. How the computer describes the data

What is the data structure that is described in the front, and what basic means can the computer use to describe it? First we know that most of the data in the computer exists on disk and in memory, and the CPU processing data must be read from disk to memory, because the memory resources are precious, we take the appropriate data structure to store data in memory to save memory space is necessary. When it comes to memory storage of data, we programmers should know that our program needs a certain amount of memory space to run on the computer, which can be easily divided into code area and data area. The code area is where our program code resides, and that part of the space we can't manage. But the data area is where the data we need to process is stored, and we are taking a reasonable approach to storing the data in that place.

We all know how the computer manages memory by assigning an address to each byte of space, so that we can access the memory data through the address. When we store the data, we can put the data into the space of the specified address, when we fetch the data, we can find the corresponding data according to the address, this method is called direct addressing, and also the indirect addressing method, which we find through the address of the data is not the data itself, but the location of data storage, Through it to find the real data, of course, indirect addressing can be indirect many times, this is the origin of multidimensional pointers. What does it have to do with data structures to say the direct and indirect addressing of the majority of days? Of course, because this is the computer organization data of the two most basic way, it is through these two basic ways, our data is stored in memory, and storage may be a continuous address space, it may be a discrete address space. Because of this, the computer has a different form of data description. Common descriptive forms include: formulation descriptions, linked list descriptions, indirect descriptions, and analog pointers.

A formulation description is a formula that calculates the position of an element and thus provides direct access to that element. However, this description must ensure that the space used is continuous, because only a contiguous address space, the address of the data can be found at a fixed offset at a time. In the case of arrays of various programming languages, each array has a contiguous space, and the array name marks the first address of the contiguous space, so if you want to access an element of this array, you can find it directly by adding an offset to the first address. So an array is a formulaic data structure describing the formula F (i) =location (i-1), where I is the first element in the array. For multidimensional arrays, the actual memory is also a continuous space, but the compiler is also in a formulaic way to describe the data structure, such as in C + + is mapped using the row master mapping, the formula for the two-dimensional array is f (i,j) =i*n+j, where I represents the line number, J for the column number, n for the number of columns. Of course, there are many data structures, such as hash lists, complete binary trees, etc., which have the advantage that many situations can save space and improve the speed of accessing data. However, there are drawbacks to this description, which are often limited, after all, many problems are not described by formula, and it is sometimes not flexible to describe the need for continuous space through formulation. For example, the Insert delete operation on the data requires moving the data.

The list is described by storing the data in discrete spaces, since the space is discrete, and the element cannot be accessed by a fixed offset. Therefore, the address of each element can be saved to the previous element, thus forming a linked list. Because of the discrete storage, the linked list is more flexible in some data operations. But this also leads to its shortcomings, such as the inability to randomly access a node, but also the use of additional pointer space.

The indirect description is to save the address of the data in a table, the actual data is stored in memory, and when the data needs to be accessed, the first lookup table finds the address of the data and then accesses the actual data. This description is often a combination of a formulation description and a link list description method. When the actual data element is larger, it is appropriate to describe it in this way.

Analog pointers This is done by using integers to simulate pointers to access data, and to be discrete stored in memory, but this discretization is limited to a certain range, because we need to apply a continuous space to simulate the heap area in order to implement the analog pointer, and according to the actual need to re-numbering this contiguous space, To use an integer to represent its address. At the same time, two linked lists, idle lists, and linked lists with data are maintained. This is equivalent to our work of allocating memory to programs for the operating system.

4. Macro-understanding of various data structures

In order to facilitate the connection of various data structures, I have divided into three main categories of the common structure: Linear table, tree, figure. Original aim, the other data structures are in these three planted according to the actual needs of the expansion. Of course, if the three categories still feel a little more, then a pierced to the figure, any data structure can be said to be a diagram, but each has its own characteristics. The following for the common data structure in three categories on the analysis, because this article only from the macro understanding of the data structure, so the details of the implementation of various data structures do not do too much to explain, want to understand the data structure of related books.

4.1 Linear table

Linear tables have many data structures, such as arrays, matrices, linked lists, stacks, queues, jump tables, hash lists, and so on. One-dimensional array is a typical linear table, multidimensional array can be regarded as a combination of multiple linear tables, the way the array is described is generally formulated to describe the way. For the matrix can be regarded as a two-dimensional array, but because there are many kinds of matrices, such as triangular matrix, sparse matrix, like the description of such a matrix in order to save space, you can use a reasonable description, such as the use of linked lists, only the non-0 elements are saved to the node. Stacks and queues actually add some kind of restriction to the linear table, the stack is LIFO, the queue is FIFO, in fact they are a special priority queue, but the rules of precedence are not the same. They can be described using formulations or they can be described using a list, but the efficiency is different. It is better to take a formulaic description of the stack, both in and out of the Efficiency O (1), which is a bit of a waste of space if the list is described, but if it is multiple stacks, it is better to use the list description. The queue is suitable for use as a linked list, since the efficiency is O (1) for linked lists whether they are added from the head or from the tail, but if the formulation is described, each deletion requires moving elements, which undoubtedly adds overhead.

Jump tables and hash lists are two data structures that are often used to describe dictionaries. Common operations in dictionaries are find, insert, delete, output in order, and so on. Although a dictionary can be implemented with an ordinary array list, it is inefficient. A jump table is an improvement to a linked list. The advantage of the list itself is that the insertion and deletion efficiency is higher than the array, but the lookup efficiency is low, so you can improve the search efficiency by adding additional pointers. The principle of jumping table is based on the idea of binary search, we know that the time complexity of binary lookup on an ordered array is O (Logn), so we can implement such a search method by adding additional pointers on the ordered list. However, careful analysis, we will find that to achieve real binary search is not an easy thing, because the elements in the jump table is not immutable, so the element on which to add additional pointers and the element should be considered as a number of elements on the list is unpredictable, so this increases the complexity of implementing the Jump table, In practice, a random method can be used to set an element as a chain of elements, specific implementation details can be see the data structure related books.

A hash table determines the position of an element by using a hash function based on a keyword, and is a formulaic description. In the ideal case, the hash list in the search, insert, delete the time complexity can reach O (1), but in reality because the change in the scope of the keyword is too large, the ideal hash list implementation requires a lot of space, causing serious waste, so there are different keywords can be mapped to the same location hash function, So the problem is, since the different keywords are mapped to the same location, then how to deal with this conflict? Two common ways of dealing with this conflict are linear open addressing hashing and linked list hashing, where the element of the same keyword is placed as far as possible into the function map position, and if the location already exists, the nearest empty bucket is searched backwards, and the linked list hash is the element on which the conflicting elements are placed on a linked list. Each of these two approaches has its own merits and demerits.

For the performance analysis of the two data structures describing the dictionary, the Hop table finds, inserts, deletes the time complexity in the best state (K+logn), where K is the chain, the worst is O (K+n), and for a hash list that takes more than one keyword mapping to the same location, the best state to find, insert , delete the time complexity is O (1), but the worst state has reached O (n), so that the hash table is always better than the Jump List, of course, but also depends on the actual problem, for example, in the sequential output, the jump table is significantly better than the hash list.

In the data structures of linear tables, they are all made up of ordinary linear tables, some of which are added to the rules, some add additional auxiliary information, and a combination of multiple linear tables. But no matter how it changes, it is still a linear table. Therefore, in the actual development, we can choose them according to the characteristics of different data structures.

4.2 Trees

Trees can be used to describe things with hierarchies, and the structure of the tree is amazing, and different data structures are formed by adding different restrictions to the tree. If the only left and right children of the tree we call the two-fork tree, under the binary tree by adding various restrictions and produced a lot of data structures, such as complete binary tree, heap, Zuo, AVL tree, red black tree, binary search tree. Here's a detailed description of these tree-related data structures.

First of all, consider a problem in computer memory why the use of binary trees to store data, instead of using multi-fork tree it? Of course, in order to improve the speed of processing efficiency, in the search for a node of the binary tree, of course, the less the better the number of comparisons. Consider finding a binary search in an ordered array more efficient than a three-point lookup, a four-point search, or even more points? The question naturally became clear.

A complete binary tree is a data structure that restricts the structure of a two-fork tree to a large extent, and one of the benefits is that this data structure is very convenient to adopt a formulaic description and greatly save space.

The heap is formed when the complete binary tree is restricted to the largest tree, and the heap is very efficient for describing the priority queue, using the heap to describe the efficiency of the priority queue insertion and deletion is O (Logn), and it is very space-saving to adopt a formulaic description. Of course, the priority queue can also be described by a linear table, but that is inefficient. However, if you want to merge the two priority queues, it is very inefficient to describe them in heaps, and you need to choose a different data structure. Zuo is the right and left sub-tree priority limit of two fork tree, as to choose what as the priority of the evaluation factors, can be the height as a factor of evaluation, can also be the number of nodes as a factor of evaluation, which formed a high priority Zuo and weight priority Zuo. The priority limit for the left and right subtree is that it is easy to merge two Zuo into a single Zuo after such a restriction. The restriction that adds the maximum tree to Zuo again forms the maximum Zuo, the largest (small) Zuo can also describe the priority queue, and is suitable for merging two trees, but it is less efficient than heap space saving.

Next, we discuss the search tree, which is another efficient data structure that can describe a dictionary. First to analyze the binary search tree, binary search tree is to limit the value of the two-fork tree node, which requires that the value of each node is larger than the value of Zuozi and smaller than the right subtree, plus this limit, the search efficiency of an element is relatively high, in the best case to find, insert, delete operation time complexity can reach O ( LOGN), however, the worst case is achieved O (n), resulting in the worst case is due to the extreme imbalance of the binary tree caused, in order to solve the problem, the balance tree is involved in, balanced binary search tree is not a good solution to this problem? However, the AVL tree needs to be rotated several times in order to maintain the balance after each insertion of the delete operation, thus reducing the efficiency. The appearance of the red and black trees is a good solution to this problem, although the red and black tree is not a fully balanced two-tree but it is also the basic balance, however, red-black tree for the insertion of the removal operation after the maintenance of red and black tree characteristics of the cost is not high. In real-world applications, many dictionaries are described using red-black trees. In addition to binary search tree, multi-fork search tree in many places also have the application, for example, when reading disk data, can be used to index the B-tree, because each time the cost of reading disk is relatively large, so the number of read disk less the better, theoretically speaking, the height of the tree as high as possible. In order to improve indexing speed, many databases are indexed with B + trees. In addition, because English words are generally used together with letters, so the English words stored in the multi-fork tree can greatly improve the efficiency of search words, this is the famous trie tree.

In view of the data structure generated by the tree above, it is very efficient to solve some specific problems by adding various restrictions to the tree to maintain a fixed data structure, so we should choose the appropriate data structure or transform some data structure according to the actual needs in the real application. Become the data structure that really suits you.

4.3 figure

It is most convenient to use the diagram to describe the extreme connection of thousands of things, however, according to the specific problems produced by the diagram is not the same, such as some things with a map to describe the more appropriate, some use of undirected graph description is more appropriate, some with a complete picture description is more appropriate, and some with the use of the two-figure description is more , regardless of the actual problem to use a different way, of course, in solving practical problems, often need to combine the necessary algorithms.

The description of diagrams is often described by adjacency matrices and adjacency lists.

Data structures-macro understanding of data structures