Performance analysis of DataTable data retrieval [GO]

Last Update:2015-05-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The original link author is very well written, I learned a lot of things, here is just reprint!

We know that in. NET platform has a lot of data storage, retrieval solution-ado.net Entity framework,asp.net Dynamic data,xml, nhibernate,linq to SQL and so on, but for some reasons, such as platform limitations, For example, it must be based on. NET Framework2.0 and the following platforms; The legacy or third-party data interface uses a DataTable, and so on, still needs to use a DataTable as the data storage structure. On the other hand, the DataTable is relatively easy to use, some data access interfaces may directly adopt the DataTable structure. There are some areas to be aware of when using DataTable for data retrieval, which can seriously affect the retrieval efficiency of data.

I have recently worked on a large number of DataTable splicing. The data of the interface is provided in the form of a dataset and then a DataTable, and the service side returns a DataSet containing the results of each request in a DataTable that contains a column equivalent to " Keyword column. Now you need to follow this keyword to merge the columns from these DataTable into a DataTable and show it to the interface.

At first, I used the DataTable Select method to iterate through the stitching implementation, found very slow, so summed up the query of the DataTable and other operations of some experience, and share with you.

a scene

To simplify the problem, there are two DataTable, named Table A, table B, and the fields are

Table A, storing the highest price information of the stock, table B stores the stock lowest price information

Securitycode High Securitycode Low

000001.SZ 000001.SZ 18.5

000002.SZ 000002.SZ 56

Now it's time to put the two tables together in a table with three columns of fields, Securitycode high, and the previous method is to create a DataTable table C containing the three fields, copy the Security field, and then traverse the other two tables. Use the Select method to find the corresponding Securitycode, and then copy to the corresponding field in C. The discovery efficiency is slow, the problem appears on the Select method, and needs to be optimized.

two query efficiency for DataTable

The DataTable provides two query data interfaces, DataTable.Select and DataTable.Rows.Find methods.

The Select method of a DataTable passes through a series of conditions and then returns a datarow[] type of data, he needs to traverse the entire table, then match the criteria one by another, and then return all matching values. Obviously, there is a problem with the Select method of previous DataTable stitching, because we only need to find a record on the match.

DataTable.Rows find finds the only record on the first match. Based on the designation of the primary key, it is efficient to find a binary tree. To create a primary key, you need to specify the PrimaryKey field for the DataTable as follows:

Dta.primarykey = new datacolumn[] {dta.columns["Securitycode"]};

Of course, creating a primary key increases time consumption, which is also created after the data is populated and after the data is populated. In the case of large amounts of data, the consumption of creating a primary key needs to be taken into account. The following figure shows the time required to create a primary key before populating the data, and then creating the primary key, as well as creating the dictionary. can see:

ArraySize	Preindex Creation Time	Postindex Creation Time	Dictionary Creation Time
10	0	0	0
50	0	0	0
100	1	0	0
500	6	1	0
1000	15	2	0
5000	107	16	2
10000	261	42	5
50000	1727	271	31
100000	3525	544	47
500000	20209	2895	240
1000000	43382	5919	517

The drawing is as follows:

From can get:

Creating a primary key before populating the data, and then populating the data, consumes more time than the primary key is created after the data is populated. This is because, after creating the primary key, adding data to it will result in the need to rebuild the index, which is the same as in the database, where it is not appropriate to create a primary key on a field that is frequently changed. On my Notebook (Win7 32bit,cpu T6600 2.0ghz,ram 2GB), it takes approximately 5 seconds to create an index for a DataTable of 1 million records, so you need to consider the creation time of the index in the case of large data volumes.
Creating a DataTable and then creating a primary key is much less time needed to create a dictionary than it is to create the same dictionary structure as the DataTable, and it will almost never grow larger with the size of the record strip.

After the creation is complete, the following is a test of the retrieval efficiency of the DataTable in several scenarios. To this end, in the establishment of the primary key and no index, the test of datatable.select at different scales, DataTable.Rows.Find query speed, because the DataTable is relatively small, time is not very good display, So the unit used for the test is the number of ticks in stopwatch. Each method is executed 10 times in different data sizes, and then averaged, with the following results:

ArraySize	Dictionary Create	Dictionary Search	Table Select	Indexed Table Select	Table Rows Find	Linq
10	13	3	40	25	8	16
50	27	2	69	37	8	27
100	51	3	112	38	9	39
500	210	3	589	51	11	155
1000	461	4	1175	60	14	328
5000	2264	14	8412	85	17	1540
10000	6235	7	16806	99	20	3354
50000	23768	8	150133	138	26	15824
100000	49133	7	259794	147	26	31525
500000	252103	51	1547935	181	30	158317
1000000	494647	9	2736616	209	30	315716

The drawing is as follows:

can see:

It is inefficient to perform a select operation on a DataTable without creating a primary key. After the primary key is established, only select operations are performed on the column where the primary key is located, which is much faster, especially when the data is large, which is nearly 20 times times larger when the collection size is 1000.
The query efficiency of LINQ to DataTable is higher than the Datatale.select method, but it is still less efficient than the DataTable.Rows.Find method.
When a primary key is uniquely located, the The Find method is 3-6 times faster than the Select method, possibly because the Select method needs to parse and judge the filter string inside, using the DataTable.Rows.Find operation, which establishes the primary key in the DataTable and operates only on the primary key. Because the Select method can accept queries for multiple conditions and with some more complex expressions, processing and parsing can take some time. And under general Conditions Select is a full search, that is, finding the entire collection to find all the records that meet the criteria. The Find method retrieves only the primary key field, and if no primary key is set, the call to the Find method causes an error.
Using dictionary instead of a DataTable structure to retrieve the fastest speed, and almost unaffected by the size, but in the case of large data volumes, converting a DataTable to the corresponding dictionary structure may take time, if the operation is frequent, For example, in the case of multiple DataTable concatenation based on keywords, the target DataTable is stored in a dictionary<string,datarow> way, The ability to use ContainsKey's hash-based approach to searching for keywords can greatly improve efficiency. And in the case that the DataTable column has duplicate fields and cannot establish a primary key, the dictionary<string,list<datarow>> can be used to solve the DataTable cannot create the primary key. This can lead to a problem in finding performance degradation.

three implementation effect

Based on the above analysis, in practice, the Select method is replaced, creating a structure C with all the rows of the target merged DataTable object of type dictionary<string,datarow>. Where the keyword is securitycode,datarow is a row containing three columns of securitycode,high,low data. At the time of merging, traverse all the rows of table a directly, and then determine whether the Securitycode in the row is included in C, and if so, the value is directly assigned. Then traverse table B. The entire process increases the efficiency of the DataTable merge by at least 10 times times.

Four conclusion

This paper briefly introduces two methods of retrieving data in DataTable, DataTable.Select and DataTable.Rows.Find methods. Before testing the execution efficiency of a method, it describes how to set a primary key for a DataTable and compares the time it takes to set the primary key before the data is populated and after the data is populated, and the results indicate that setting the primary key is much more efficient than setting the primary key before the data is populated. after you set the primary key, compare the performance of the DataTable.Select method when there are no primary keys, and the results show that when you retrieve only the primary key, the primary key is set to use The DataTable.Select method will be much faster to retrieve than without a primary key. under the same conditions, if you only need to find a record, using DataTable.Rows.Find is much faster than DataTable.Select. to avoid calling the datatable.select method in the loop body when you need to frequently manipulate DataTable queries, use the DataTable Conversion to an equivalent dictionary structure can effectively solve the problem that the primary key cannot be created because of duplicate key values , and dicitonary The use of a hash table to find a way to greatly improve query efficiency.

Performance analysis of DataTable data retrieval [GO]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Performance analysis of DataTable data retrieval [GO]

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support