Quick Find method for data records (RPM)

Last Update:2015-03-10 Source: Internet

Author: User

Tags repetition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the daily algorithm, the search is a frequently involved topic, and how to improve the speed of search, but also a lot of programmers, software research topics.

1, the issue of the proposed:

There is such a data type S:

Student name (name), Gender (sex), age ... ，

Now suppose there is such a demand;

Files A, b are stored in a large number of S records, you need to remove a, b duplicate records.

We use C code to illustrate today's topic:

typedef struct TAGSTUDENT

{

Char *name;

BOOL Bsex;

int age;

...

For such a problem, the simple approach should be as follows:

A reads records, saves to a linked list (pointer linked list, array list can be) Lista, B reads records, saves to Listb, and then uses a double loop to implement the lookup:

int I, J;

int ncounta = Lista.getcount ();

int ncountb = Listb.getcount ();

for (i=0; i < Ncounta; i++)

{

S *pa = (s*) lista.getat (i);

for (j=0; j< Ncountb; j + +)

{

S *PB = (s*) listb.getat (j);

Compare, Match

if (strcmp (pa->name, pb->name) = = 0)

{

Lista.deleteat (i);

Listb.deleteat (j);

I--;

NCOUNTB--;

ncounta--;

break;

}

Suppose the comparison match uses a similar strcmp algorithm with a time complexity of O (M),

Such an algorithm, the time complexity of O (n1*n2*m), the space complexity of O (N1) +o (N2).

N1 is lista size, N2 is Listb size, and M is name length.

Below, let's optimize for this algorithm.

First of all, we are here to discuss the use of pointers linked list good or array chain list good.

The advantage of a pointer-linked list is that it is inserted, deleted quickly, but slow to locate.

The advantage of the array list is the positioning block, insertion, deletion, and very slow.

The first feeling should be the pointer linked list will be better, indeed, we can find a duplicate data in the matching process to delete, it is indeed more convenient.

In fact, in this place, the advantage of using the array list is greater.

Let's analyze the principle of pointer-linked list and array-linked list insertion.

Each time the linked list pointer is inserted, a new data is inserted directly at the tail pointer, with a time complexity of O (1), and no other trivial overhead is discussed.

The reason for the slow insertion of arrays is that if the data you insert is not the last one, you need to move all the data behind the insertion point backward one position; the deletion is the same, and the time complexity is O (n). However, if the inserted data is inserted at the end, then this drawback does not exist. Instead, insertions, deletions, and so on will change quickly. This is the same as the tail pointer of the linked list, which can be completed in O (1) time.

This paper analyzes the insertion principle of arrays, and then modifies the process of finding them. Do not delete the duplicate data every time you find it, but make a mark first. When the lookup is finished, the data that is not tagged is copied into another array, which avoids the disadvantages of the array and the advantages of the array. The C code demonstrates the following

typedef struct TAGSTUDENT

{

Char *name;

BOOL Bsex;

int age;

BOOL brepeat;//adds a marker bit

...

int I, J;

int ncounta = Lista.getcount ();

int ncountb = Listb.getcount ();

for (i=0; i < Ncounta; i++)

{

S *pa = (s*) lista.getat (i);

if (!pa-> brepeat)

{

for (j=0; j< Ncountb; j + +)

{

S *PB = (s*) listb.getat (j);

if (!pb-> brepeat)

{

Compare, Match

if (strcmp (pa->name, pb->name) = = 0)

{

Pa->brepeat = true;

Pb->brepeat = true;

break;

}

One might say, using a linked list can also be labeled, do not delete ah, why not use the linked list? Yes, you can use a linked list, but the operation of the list is the operation of the pointer, the positioning operation, the fastest also need to move the pointer (p = p->next), and for the array, only need to i++, the speed of the two operations is still different. (Of course, the difference is not big, but programmers are not often crazy for a cycle of instruction?) ）

Here, we are only the design of data structure, the focus of optimization is to temporarily do not delete duplicate data, and mark, and so on after processing, then unified processing. This algorithm increases the complexity of the space and changes from O (n) to O (2*n).

So is there any way to further optimize this algorithm?

Yes, that's the point I'm going to describe today.

As you can imagine, there are a lot of repetitive steps in the algorithm above. For example, each time a data is taken in a Lista, to go to B to find, you need to traverse B once, which causes the operation in B is too frequent. The final algorithm is slow. So now we're going to optimize the traversal of this step in B.

We are here to match the student name as the keyword, for the sake of simplicity, assuming that all student names in data A are not duplicated, data B also satisfies this condition, and later we consider a more complicated situation.

For each student name, since the unique, then you can use a numeric value to represent, for example, 1 for "Zhang San", 2 for "John Doe", then if there is "Zhang San" in the B, to find whether there is "Zhang San", only need to see whether there is 1 in B. Some would say, what is the point of doing this? The meaning is two points:

1) lookup using a numeric value (DWORD) is faster than a string lookup. Because the string lookup time complexity is O (n), and the value matches, the time complexity is O (1)

2) Understand this idea, for the subsequent optimization method can be more easily accepted.

So how to "Zhang San", "John Doe" corresponding to the value of?

Use CRC (Cyclic redundancy check) to achieve this correspondence. Assuming that the hex value for "Zhang San" is "D5 c5 C8 fd", then you can define the value DWORD dwindex = d5 + c5 + c8 + FD, which is the CRC value. Of course, the value of how to calculate, can be exactly as needed, such as you think the addition of the CRC value is large repetition, then you can use multiplication, shift and other algorithms, so that the values are scattered, reduce the probability of repetition. However, there is no way to avoid duplication, so each time you find a matching value, you should continue to check that the student name matches. The algorithm description is:

typedef struct TAGSTUDENT

{

unsigned long ulindex;//fill in data A and B when reading

Char *name;

BOOL Bsex;

int age;

BOOL brepeat;//adds a marker bit

...

int I, J;

int ncounta = Lista.getcount ();

int ncountb = Listb.getcount ();

for (i=0; i < Ncounta; i++)

{

S *pa = (s*) lista.getat (i);

if (!pa-> brepeat)

{

for (j=0; j< Ncountb; j + +)

{

S *PB = (s*) listb.getat (j);

if (!pb-> brepeat)

{

Compare, Match

if (Pa->ulindex = = Pb->ulindex)

{

Pa->brepeat = true;

Pb->brepeat = true;

break;

}

In this way, the algorithm can be optimized to O (N1*N2).

In this step, it is estimated that many people have seen the clues. Well, next, we're going to use a hash table (HashTable) to optimize the data store.

As we said above, when we read the data, we save it to the list or array, and now we're going to save it to the hash table, and the index of the hash table is the value we said above, based on the student's name. The size of the hash table can be defined according to the situation. The data structure of a hash table can be designed as a hash list, and the advantages of a hash table are:

1) The supported data can be expanded wirelessly (the linked list can be stored in wireless large).

2) The algorithm for collisions is simple and fast.

typedef struct TAGSTUDENT

{

unsigned long ulindex;//fill in data A and B when reading

Char *name;

BOOL Bsex;

int age;

BOOL brepeat;//adds a marker bit

...

Iterator ItA = Hasha.begin ();

for (; ItA! = Ita.end (); ++ita)

{

s *PA = (s *) ItA;

S *PB = Hashb.find (Pa->ulindex);

if (PB)

{

Pa->brepeat = true;

Pb->brepeat = true;

}

Let's take a look at the complexity of the present time. Assuming that the hash table size is m, the algorithm complexity for each match is

O (n/m), the algorithm to complete the operation of the complexity of O (n1*n2/m).

Look back at one of our remaining questions, if the student name is duplicated, how to solve?

We can define the hash list, if there are duplicate data, put it into the list, so when matching, we need

After the index value is found, it is then further used to find the matching student record with the string match.

However, you need to know that if you want to support fuzzy matching, there is no way to optimize this algorithm.

However, if the data itself can extract eigenvalues, fuzzy matching can also be supported. The specific situation depends on the actual situation to see.

For the linked list, the hash table data structure, can be designed by themselves, but also can use STL, but I think, the design of their own data structure, the use of more convenient. Slow speed? Oh, the STL did not use a special algorithm, nor the use of the assembly, their own design why will be slow? If it is slow, it can only show you the design of the data structure has a problem, hehe.

Additional knowledge:

Comparison of 1, binary search algorithm and 22-Tree search algorithm

First, what is the two-fork find tree:

Binary search Tree: Features: 1, if its left subtree is not empty, then all node values on the left subtree are less than its root node value; 2, if its right subtree is not empty, then all node values on the right subtree are greater than its root node value, and 3, its left and the subtree are also two forks to find the tree.

The time complexity of the binary search algorithm is O (LOGN), while the worst time complexity and order structure of the binary lookup algorithm is O (n), the best case is the same time complexity as the binary lookup algorithm, O (LOGN).

This is determined by the structure of the two-fork tree, if the binary search tree is a balanced binary tree, find the algorithm time complexity is low, if the binary search tree structure is very uneven, close to the linear structure, then the efficiency of this algorithm is difficult to guarantee.

2. Time complexity of hash table lookup algorithm

If N elements are stored in a hash table, and the address range of the hash table is M, then the time complexity of the lookup is O (n/m).

Give an example of applying a hash table.

Quick Find method for data records (RPM)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More