The original link author is very well written, I learned a lot of things, here is just reprint!
We know that in. NET platform has a lot of data storage, retrieval solution-ado.net Entity framework,asp.net Dynamic data,xml, nhibernate,linq to SQL and so on, but for some reasons, such as platform limitations, For example, it must be based on. NET Framework2.0 and the following platforms; The legacy or third-party data interface uses a DataTable, and so on, still needs to use a DataTable as the data storage structure. On the other hand, the DataTable is relatively easy to use, some data access interfaces may directly adopt the DataTable structure. There are some areas to be aware of when using DataTable for data retrieval, which can seriously affect the retrieval efficiency of data.
I have recently worked on a large number of DataTable splicing. The data of the interface is provided in the form of a dataset and then a DataTable, and the service side returns a DataSet containing the results of each request in a DataTable that contains a column equivalent to " Keyword column. Now you need to follow this keyword to merge the columns from these DataTable into a DataTable and show it to the interface.
At first, I used the DataTable Select method to iterate through the stitching implementation, found very slow, so summed up the query of the DataTable and other operations of some experience, and share with you.
a
scene
To simplify the problem, there are two DataTable, named Table A, table B, and the fields are
Table A, storing the highest price information of the stock, table B stores the stock lowest price information
Securitycode High Securitycode Low
000001.SZ 000001.SZ 18.5
000002.SZ 000002.SZ 56
Now it's time to put the two tables together in a table with three columns of fields, Securitycode high, and the previous method is to create a DataTable table C containing the three fields, copy the Security field, and then traverse the other two tables. Use the Select method to find the corresponding Securitycode, and then copy to the corresponding field in C. The discovery efficiency is slow, the problem appears on the Select method, and needs to be optimized.
two
query efficiency for DataTable
The DataTable provides two query data interfaces, DataTable.Select and DataTable.Rows.Find methods.
The Select method of a DataTable passes through a series of conditions and then returns a datarow[] type of data, he needs to traverse the entire table, then match the criteria one by another, and then return all matching values. Obviously, there is a problem with the Select method of previous DataTable stitching, because we only need to find a record on the match.
DataTable.Rows find finds the only record on the first match. Based on the designation of the primary key, it is efficient to find a binary tree. To create a primary key, you need to specify the PrimaryKey field for the DataTable as follows:
Dta.primarykey = new datacolumn[] {dta.columns["Securitycode"]};
Of course, creating a primary key increases time consumption, which is also created after the data is populated and after the data is populated. In the case of large amounts of data, the consumption of creating a primary key needs to be taken into account. The following figure shows the time required to create a primary key before populating the data, and then creating the primary key, as well as creating the dictionary. can see:
ArraySize |
Preindex Creation Time |
Postindex Creation Time |
Dictionary Creation Time |
10 |
0 |
0 |
0 |
50 |
0 |
0 |
0 |
100 |
1 |
0 |
0 |
500 |
6 |
1 |
0 |
1000 |
15 |
2 |
0 |
5000 |
107 |
16 |
2 |
10000 |
261 |
42 |
5 |
50000 |
1727 |
271 |
31 |
100000 |
3525 |
544 |
47 |
500000 |
20209 |
2895 |
240 |
1000000 |
43382 |
5919 |
517 |
The drawing is as follows:
From can get:
- Creating a primary key before populating the data, and then populating the data, consumes more time than the primary key is created after the data is populated. This is because, after creating the primary key, adding data to it will result in the need to rebuild the index, which is the same as in the database, where it is not appropriate to create a primary key on a field that is frequently changed. On my Notebook (Win7 32bit,cpu T6600 2.0ghz,ram 2GB), it takes approximately 5 seconds to create an index for a DataTable of 1 million records, so you need to consider the creation time of the index in the case of large data volumes.
- Creating a DataTable and then creating a primary key is much less time needed to create a dictionary than it is to create the same dictionary structure as the DataTable, and it will almost never grow larger with the size of the record strip.
After the creation is complete, the following is a test of the retrieval efficiency of the DataTable in several scenarios. To this end, in the establishment of the primary key and no index, the test of datatable.select at different scales, DataTable.Rows.Find query speed, because the DataTable is relatively small, time is not very good display, So the unit used for the test is the number of ticks in stopwatch. Each method is executed 10 times in different data sizes, and then averaged, with the following results:
| ArraySize |
Dictionary Create |
Dictionary Search |
Table Select |
Indexed Table Select |
Table Rows Find |
Linq |
| 10 |
13 |
3 |
40 |
25 |
8 |
16 |
| 50 |
27 |
2 |
69 |
37 |
8 |
27 |
| 100 |
51 |
3 |
112 |
38 |
9 |
39 |
| 500 |
210 |
3 |
589 |
51 |
11 |
155 |
| 1000 |
461 |
4 |
1175 |
60 |
14 |
328 |
| 5000 |
2264 |
14 |
8412 |
85 |
17 |
1540 |
| 10000 |
6235 |
7 |
16806 |
99 |
20 |
3354 |
| 50000 |
23768 |
8 |
150133 |
138 |
26 |
15824 |
| 100000 |
49133 |
7 |
259794 |
147 |
26 |
31525 |
| 500000 |
252103 |
51 |
1547935 |
181 |
30 |
158317 |
| 1000000 |
494647 |
9 |
2736616 |
209 |
30 |
315716 |
The drawing is as follows:
can see:
- It is inefficient to perform a select operation on a DataTable without creating a primary key. After the primary key is established, only select operations are performed on the column where the primary key is located, which is much faster, especially when the data is large, which is nearly 20 times times larger when the collection size is 1000.
- The query efficiency of LINQ to DataTable is higher than the Datatale.select method, but it is still less efficient than the DataTable.Rows.Find method.
- When a primary key is uniquely located, the The Find method is 3-6 times faster than the Select method, possibly because the Select method needs to parse and judge the filter string inside, using the DataTable.Rows.Find operation, which establishes the primary key in the DataTable and operates only on the primary key. Because the Select method can accept queries for multiple conditions and with some more complex expressions, processing and parsing can take some time. And under general Conditions Select is a full search, that is, finding the entire collection to find all the records that meet the criteria. The Find method retrieves only the primary key field, and if no primary key is set, the call to the Find method causes an error.
- Using dictionary instead of a DataTable structure to retrieve the fastest speed, and almost unaffected by the size, but in the case of large data volumes, converting a DataTable to the corresponding dictionary structure may take time, if the operation is frequent, For example, in the case of multiple DataTable concatenation based on keywords, the target DataTable is stored in a dictionary<string,datarow> way, The ability to use ContainsKey's hash-based approach to searching for keywords can greatly improve efficiency. And in the case that the DataTable column has duplicate fields and cannot establish a primary key, the dictionary<string,list<datarow>> can be used to solve the DataTable cannot create the primary key. This can lead to a problem in finding performance degradation.
three
implementation effect
Based on the above analysis, in practice, the Select method is replaced, creating a structure C with all the rows of the target merged DataTable object of type dictionary<string,datarow>. Where the keyword is securitycode,datarow is a row containing three columns of securitycode,high,low data. At the time of merging, traverse all the rows of table a directly, and then determine whether the Securitycode in the row is included in C, and if so, the value is directly assigned. Then traverse table B. The entire process increases the efficiency of the DataTable merge by at least 10 times times.
Four
conclusion
This paper briefly introduces two methods of retrieving data in DataTable, DataTable.Select and DataTable.Rows.Find methods. Before testing the execution efficiency of a method, it describes how to set a primary key for a DataTable and compares the time it takes to set the primary key before the data is populated and after the data is populated, and the results indicate that setting the primary key is much more efficient than setting the primary key before the data is populated. after you set the primary key, compare the performance of the DataTable.Select method when there are no primary keys, and the results show that when you retrieve only the primary key, the primary key is set to use The DataTable.Select method will be much faster to retrieve than without a primary key. under the same conditions, if you only need to find a record, using DataTable.Rows.Find is much faster than DataTable.Select. to avoid calling the datatable.select method in the loop body when you need to frequently manipulate DataTable queries, use the DataTable Conversion to an equivalent dictionary structure can effectively solve the problem that the primary key cannot be created because of duplicate key values , and dicitonary The use of a hash table to find a way to greatly improve query efficiency.
Performance analysis of DataTable data retrieval [GO]