Find duplicate data

Last Update:2017-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Now there is a batch of address book data (more than 10000 people's address book) that needs to find out the duplicate part of the address book of each two people (that is, who and who have already done the same) we need to compare the address book of all people. for example, we need to find the address book of four people ABCDE and find ABACADAEBCBDBECDCE...

Now there is a batch of address book data (more than 10000 people's address book) that needs to find out the duplicate part of the address book of each two people (that is, who and who have already done the same) compare the address book of all users.
For example, if there are four ABCDE contacts, find the number of address book duplicates between AB ac ad AE BC BD BE CD CE DE.

If the phone number is repeated, the two address books are duplicated.
This is a data table with more than 10000 contacts.

The json stored in the list Field is the address book content.
A person's address book contains 100 to 1000 entries.

What I try to do now is to retrieve the address book of all people and compare the first person with the rest people (foreach, nested foreach) then compare the second person with the rest, and so on.
Script Code

Then the script ran for more than 20 hours before it ran about half.Memory, CPU usage is also relatively high, script efficiency is too low

Is there a better way to find out the duplicates of this batch of data, or how to optimize the script?

Thank you!

Reply content:

Now there is a batch of address book data (more than 10000 people's address book) that needs to find out the duplicate part of the address book of each two people (that is, who and who have already done the same) compare the address book of all users.
For example, if there are four ABCDE contacts, find the number of address book duplicates between AB ac ad AE BC BD BE CD CE DE.

If the phone number is repeated, the two address books are duplicated.
This is a data table with more than 10000 contacts.

The json stored in the list Field is the address book content.
A person's address book contains 100 to 1000 entries.

Then the script ran for more than 20 hours before it ran about half.Memory, CPU usage is also relatively high, script efficiency is too low

Is there a better way to find out the duplicates of this batch of data, or how to optimize the script?

Thank you!

$ Data = array ('id' => 1, 'name' => 1), array ('id' => 2, 'name' => 2 ), array ('id' => 3, 'name' => 3), array ('id' => 1, 'name' => 2 )); $ ret = array (); # The data is traversed once, and the duplicate key is calculated as the key to create data. if value + 1 exists, if not, set it to 1 foreach ($ data as $ k => $ v) {$ _ id = $ v ['id']; $ _ name = $ v ['name']; if (array_key_exists ($ _ id, $ ret) {$ ret [$ _ id] ++ ;} else {$ ret [$ _ id] = 1 ;}# traverse the result foreach ($ ret as $ k =>$ v) {echo "{$ k} appears {$ v} Times \ n" ;}# print_r ($ id)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Find duplicate data

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support