Optimal Algorithm for de-duplication of data in generic sets, algorithm set

Source: Internet
Author: User

Optimal Algorithm for de-duplication of data in generic sets, algorithm set

I am responsible for O & M of the air ticket travel Analysis Report project. The data required for the analysis report (order data and basic dimension data) is extracted from the Business Database. Here, there is a synchronization program for user account data. Today, I checked the online log and found that the synchronization program encountered an exception:

An exception occurred while executing the ExecuteSqlCommand method to capture System. Data. SqlClient. SqlException: The primary key constraint "PK_BASEUSERACCOUNT" was violated ". Duplicate keys cannot be inserted in the object "dbo. BaseUserAccount. The duplicate key value is (105487 ). The primary key constraint "PK_BASEUSERACCOUNT" is violated ". Duplicate keys cannot be inserted in the object "dbo. BaseUserAccount. The duplicate key value is (105488 ). The statement has been terminated. The statement has been terminated ., SQL: insert BaseUserAccount (AccountId, AccountName, LoginName, EntId, EntName, DeptId, DeptName, CreateTime) values (74188, 'xiaoyan ', 'xiaoyan', 49261, 'tai Chi Computer Co., Ltd.-smart city SBU department 1 ', 49265, 'sales Department', '2017/19 16:11:23 '); insert BaseUserAccount (AccountId, AccountName, LoginName, EntId, entName, DeptId, DeptName, CreateTime) values (74205, 'xu Lin', 'xu Lin', 49261, 'tai Chi Computer Co., Ltd.-smart city SBU department 1 ', 49265, 'sales Department ', '2017/19

At the underlying layer of the program, EF is used. The logic of account synchronization is to read data from the data source and put it into a List set. Then, when the local system persists, the table is cleared first, convert the List data and insert it in batches. Through analysis, duplicate data exists in the data obtained from the data source, resulting in a primary key conflict during insertion.

 

The data storage of the data source system is messy and cannot be changed. We can only make articles here. Therefore, the proposed improvement solution is to de-duplicate the set data according to AccountId.

I spoke to a member of the development team and told me that there are more than 60 thousand records obtained from the data source. Pay attention to optimizing the deduplication algorithm. If deduplication is performed according to the regular order, it may take 5 minutes. It is expected that the optimization can be controlled within half a minute.

The next day, the student handed in. The List set of 60 thousand records. The average algorithm time does not exceed 15 ms. Like one!

The implementation scheme is to use the Distinct method of List and then override the Equals method of the object class. The Code is as follows:

Namespace EntOlap. ETL. EF {public partial class BaseUserAccount // because it is EF, a new partial class {public override bool Equals (object obj) {BaseUserAccount bua = obj as BaseUserAccount is created here; if (bua = null) {return false;} else {return this. accountId = bua. accountId ;}} public override int GetHashCode () {return AccountId. getHashCode ();}}}

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.