Find duplicate data

Source: Internet
Author: User
Now there is a batch of address book data (more than 10000 people's address book) that needs to find out the duplicate part of the address book of each two people (that is, who and who have already done the same) we need to compare the address book of all people. for example, we need to find the address book of four people ABCDE and find ABACADAEBCBDBECDCE...
  1. Now there is a batch of address book data (more than 10000 people's address book) that needs to find out the duplicate part of the address book of each two people (that is, who and who have already done the same) compare the address book of all users.
    For example, if there are four ABCDE contacts, find the number of address book duplicates between AB ac ad AE BC BD BE CD CE DE.

If the phone number is repeated, the two address books are duplicated.
This is a data table with more than 10000 contacts.

The json stored in the list Field is the address book content.
A person's address book contains 100 to 1000 entries.

What I try to do now is to retrieve the address book of all people and compare the first person with the rest people (foreach, nested foreach) then compare the second person with the rest, and so on.
Script Code


Then the script ran for more than 20 hours before it ran about half.Memory, CPU usage is also relatively high, script efficiency is too low

Is there a better way to find out the duplicates of this batch of data, or how to optimize the script?

Thank you!

Reply content:
  1. Now there is a batch of address book data (more than 10000 people's address book) that needs to find out the duplicate part of the address book of each two people (that is, who and who have already done the same) compare the address book of all users.
    For example, if there are four ABCDE contacts, find the number of address book duplicates between AB ac ad AE BC BD BE CD CE DE.

If the phone number is repeated, the two address books are duplicated.
This is a data table with more than 10000 contacts.

The json stored in the list Field is the address book content.
A person's address book contains 100 to 1000 entries.

What I try to do now is to retrieve the address book of all people and compare the first person with the rest people (foreach, nested foreach) then compare the second person with the rest, and so on.
Script Code


Then the script ran for more than 20 hours before it ran about half.Memory, CPU usage is also relatively high, script efficiency is too low

Is there a better way to find out the duplicates of this batch of data, or how to optimize the script?

Thank you!

$ Data = array ('id' => 1, 'name' => 1), array ('id' => 2, 'name' => 2 ), array ('id' => 3, 'name' => 3), array ('id' => 1, 'name' => 2 )); $ ret = array (); # The data is traversed once, and the duplicate key is calculated as the key to create data. if value + 1 exists, if not, set it to 1 foreach ($ data as $ k => $ v) {$ _ id = $ v ['id']; $ _ name = $ v ['name']; if (array_key_exists ($ _ id, $ ret) {$ ret [$ _ id] ++ ;} else {$ ret [$ _ id] = 1 ;}# traverse the result foreach ($ ret as $ k =>$ v) {echo "{$ k} appears {$ v} Times \ n" ;}# print_r ($ id)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.