Ask for an algorithm to remove data duplication of tens of thousands of lines of text files

Source: Internet
Author: User
Tags repetition
Ask the algorithm Delphi/Windows SDK/API to remove the data duplication of tens of thousands of lines of text files
Http://www.delphi2007.net/DelphiBase/html/delphi_20061211021429203.html
There is a text file with tens of thousands of lines of data. One row is a data file. Which algorithm is used to remove duplicate data is the most efficient? It is very slow. Thank you for your help.

Go top, continue to consult

It will only be stupid.

1: Put it in the database to help you remove duplicates

2: Sort them first, and then find the duplicates in a loop. There are many options for sorting methods.

Yes, sort first, and then exclude

Is sorting put in a TSTRINGLIST for example?

Select distanct

Sort and then discharge is relatively simple, algorithm complexity O (N * Log (N ))
(I don't know how high the repetition rate is ?)

The Hash value is used to quickly discharge data. The algorithm complexity is close to O (N ):
1. Create a TNode array (the size is the same as the data size that is not repeated, or it can be larger) TNode = record Str: Pchar PNext: ^ TNode end; (one-way linked list)
2. calculate the Hash value of each data S and map it to array element I. PChar = nil, then I. PChar = s; otherwise, compare all the string values of this one-way linked list to see if S already exists. If not, add it to the end of I;

Agree upstairs
Hash + sorting

Procedure TForm1.Button1Click (Sender: TObject );
Var
AInput: TStringList;
AOutput: TStringList;
ILoop: Integer;
STemp: String;
Begin
AInput: = TStringList. Create;
Try
AInput. LoadFromFile ('C: \ Input. TXT ');
AOutput: = TStringList. Create;
Try
For iLoop: = 0 to AInput. Count-1 do
Begin
STemp: = AInput. Strings [iLoop];
If AOutput. IndexOf (sTemp) <0 then AOutput. Add (sTemp );
End;
AOutput. SaveToFile ('C: \ Outpt. TXT ');
Finally
AOutput. Free;
End;
Finally
AInput. Free;
End;
End;

Jadeluo (xiufeng) algorithm is easier to implement directly

Hash algorithm is the fastest but troublesome

Recommended jadeluo (xiufeng)

Var
Myl, mym: TStringList;
Mys: string;
Myi: integer;
Begin
Myl: = TStringList. create;
Mym: = TStringList. create;
Myl.LoadFromFile('input.txt ');
Myl. Sort;
Mys: = '';
For myi: = 0 to myl. Count-1 do
If mys <> myl. Strings [myi] then
Begin
Mym. Add (mys );
Mys: = myl. Strings [myi];
End;
Mym.SaveToFile('out.txt ');
Myl. Free;
Mym. Free;
End;

How efficient is this method?

Do not use tstring. use SQL as the fastest. first create a table and then import it to the table with bcp. the bcp syntax has the function of removing duplicates, then, use the bcp command to introduce it.

If the number of rows processed by tstring is small, tens of thousands or more lines are returned. If the number of rows processed by tstring is smaller, the tstring will slow down.

Another method:
Use the disk storage method (create a temporary directory) to treat a row of data as a file name and store it on the hard disk. If the name is duplicated, it will be enough to go from start to end. After the execution, save it as a file using the dir command!

Tell me about the fuel tank. I will give you a solution without using a database. The performance is extremely fast.

808886@gmail.com

Thank you!

I made an algorithm, which repeats the query in 0.1 million rows and uses 15 M. My machine is dual-core 3.5G

I have implemented an algorithm that repeats the query in 0.1 million rows. It takes 15 seconds. My machine is dual-core 3.5 GB and I don't know if it meets the requirements

Adding a line of Sort can increase the speed.
For the Sort and IndexOf functions of the TStringList class, Delphi provides the source code. Specifically, Sort uses quick sorting, and IndexOf performs binary search in the case of Sort; otherwise, it performs sequential search.
Efficiency should not be low.

Procedure TForm1.Button1Click (Sender: TObject );
Var
AInput: TStringList;
AOutput: TStringList;
ILoop: Integer;
STemp: String;
Begin
AInput: = TStringList. Create;
Try
AInput. LoadFromFile ('C: \ Input. TXT ');
AInput. Sort;
AOutput: = TStringList. Create;
Try
For iLoop: = 0 to AInput. Count-1 do
Begin
STemp: = AInput. Strings [iLoop];
If AOutput. IndexOf (sTemp) <0 then AOutput. Add (sTemp );
End;
AOutput. SaveToFile ('C: \ Outpt. TXT ');
Finally
AOutput. Free;
End;
Finally
AInput. Free;
End;
End;

My algorithm is to change the memory time.

Unit uCheckDup;

Interface

Uses
Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls,
Dialogs, StdCtrls;

Procedure StartCheckDup;
Function CheckDup (AStr: string): boolean;

Implementation

Var
StrListArray: array of TStringList;

Const
BufSize = 65536; // 64 K

Procedure StartCheckDup;
Var
I: integer;
Begin
SetLength (StrListArray, BufSize );
For I: = 0 to BufSize-1 do
StrListArray [I]. Clear;
End;

Function CheckDup (AStr: string): boolean;
Type
TWordArray = array of word;
Var
Key: word;
I, L: integer;
AStrList: TStringList;
Begin
Key: = 0;
L: = length (AStr );
If L = 1 then
Key: = Ord (AStr [1])
Else
For I: = (L shr 1)-1 downto 0 do
Key: = Key + TWordArray (PChar (AStr) [I];

If (L and 1) <> 0 then
Key: = Key + Ord (AStr [L]);

AStrList: = StrListArray [Key];
If (AStrList. Count = 0) or (AStrList. IndexOf (AStr) <0) then
Begin
AStrList. Append (AStr );
Result: = False;
End
Else
Result: = True;
End;

Procedure GenerateArray;
Var
I: integer;
Begin
SetLength (StrListArray, BufSize );
For I: = 0 to BufSize-1 do
StrListArray [I]: = TStringList. Create;
End;

Procedure FreeArray;
Var
I: integer;
Begin
For I: = 0 to BufSize-1 do
FreeAndNil (StrListArray [I]);
End;

Initialization
GenerateArray;
Finalization
FreeArray;
End.

Usage: procedure TForm1.Button1Click (Sender: TObject );
Var
ATick: DWord;
I: integer;
Begin
ATick: = GetTickCount;
StartCheckDup;
Sl2.Clear;
For I: = 0 to sl. Count-1 do
Begin
If not CheckDup (sl [I]) then
Sl2.Append (sl [I]);
Caption: = IntToStr (I );
End;
ShowMessage ('time: '+ IntToStr (GetTickCount-ATick)
+ 'Ms, Remains: '+ IntToStr (sl2.Count ));
End;

Up

Practical Algorithm discussion.

// 100,000 rows, 1.2 seconds for dual-core 1.6G

Var
I: Integer;
VTickCount: Longword;
Begin
Randomize; // test
With TStringList. Create do try
// LoadFromFile('input.txt '); // load the file
For I: = 1 to 100000 do Add (IntToStr (Random (MaxInt); // generate 100,000 lines of text

VTickCount: = GetTickCount;
Sort; // Sort

For I: = Count-1 downto 0 do
If (I> = 1) and (Strings [I] = Strings [I-1]) then
Delete (I );
Caption: = IntToStr (GetTickCount-vTickCount); // when outputting
Finally
Free;
End;
End;

In this case, it is better to specify the string length as a fixed value. I recommend 20. Test data:
Var
I: integer;
Begin
Randomize;
Sl. BeginUpdate;
Sl. Clear;
For I: = 0 to 100000 do
Begin
Sl. Append ('test' + format ('%. 11d', [random (50000)]);
Caption: = IntToStr (I );
End;
Sl. Sort;
Sl. EndUpdate;

I found that Caption: = IntToStr (I); is a speed killer. After I close it, my algorithm processes 0.1 million pieces of data at a speed of 2 seconds! It seems to be slower than zswang (with water clear) (expert outpatient cleaners. However, I suspect that the data used by zswang (with water clear) (expert outpatient cleaners) should be roughly half of the repeated data (that is, 0.1 million calls to random (50000). How can we see the performance?

Breaking the latest record: I modified the algorithm to make zswang's for I: = 1 to 100000 do Add (IntToStr (Random (MaxInt ))) it takes only 0.1 million seconds for me to check the 0.3 data records. It took 0.1 million seconds to process 2.3 records of High-repeat data produced by myself!
Thank you. Note: the return value of this algorithm function is adjusted, and true is returned without repetition, which is the opposite of the previous one.
Unit uCheckDup;

Interface

Uses
Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls,
Dialogs, StdCtrls;

Procedure StartCheckDup;
Function CheckDup (AStr: string): boolean;

Implementation

Const
BufSize = 65536; // 64 K

Var
StrListArray: array of TStringList;
Crc16Tab: array [0 .. $ FF] of word =
($00000, $01021, $02042, $03063, $04084, $050a5, $060c6, $070e7,
$08108, $09129, $ 0a14a, $ 0b16b, $ 0c18c, $ 0d1ad, $ 0e1ce, $ 0f1ef,
$01231, $00210, $03273, $02252, $052b5, $04294, $072f7, $062d6,
$09339, $08318, $ 0b37b, $ 0a35a, $ 0d3bd, $ 0c39c, $ 0f3ff, $ 0e3de,
$02462, $03443, $00420, $01401, $064e6, $074c7, $044a4, $05485,
$ 0a56a, $ 0b54b, $08528, $09509, $ 0e5ee, $ 0f5cf, $ 0c5ac, $ 0d58d,
$03653, $02672, $01611, $00630, $076d7, $066f6, $05695, $046b4,
$ 0b75b, $ 0a77a, $09719, $08738, $ 0f7df, $ 0e7fe, $ 0d79d, $ 0c7bc,
$048c4, $058e5, $06886, $078a7, $00840, $01861, $02802, $03823,
$ 0c9cc, $ 0d9ed, $ 0e98e, $ 0f9af, $08948, $09969, $ 0a90a, $ 0b92b,
$05af5, $04ad4, $07ab7, $06a96, $01a71, $00a50, $03a33, $02a12,
$0 dbfd, $0 cbdc, $0 fbbf, $ 0eb9e, $09b79, $08b58, $ 0bb3b, $ 0ab1a,
$06ca6, $07c87, $04ce4, $05cc5, $02c22, $03c03, $ 0060c, $01c41,
$0 edae, $ 0fd8f, $0 cdec, $0 ddcd, $ 0ad2a, $ 0bd0b, $08d68, $09d49,
$07e97, $06eb6, $05ed5, $04ef4, $03e13, $02e32, $01e51, $00e70,
$ 0ff9f, $0 efbe, $0 dfdd, $0 cffc, $ 0bf1b, $ 0af3a, $09f59, $08f78,
$09188, $081a9, $ 0b1ca, $ 0a1eb, $ 0d10c, $ 0c12d, $ 0f14e, $ 0e16f,
$01080, $000a1, $030c2, $020e3, $05004, $04025, $07046, $06067,
$ 083b9,$ 09398, $ 0a3fb, $ 0b3da, $ 0c33d, $ 0d31c, $ 0e37f, $ 0f35e,
$002b1, $01290, $022f3, $032d2, $04235, $05214, $06277, $07256,
$ 0b5ea, $ 0a5cb, $095a8, $08589, $ 0f56e, $ 0e54f, $ 0d52c, $ 0c50d,
$034e2, $024c3, $014a0, $00481, $07466, $06447, $05424, $04405,
$ 0a7db, $ 0b7fa, $08799, $097b8, $ 0e75f, $ 0f77e, $ 0c71d, $ 0d73c,
$026d3, $036f2, $00691, $016b0, $06657, $07676, $04615, $05634,
$ 0d94c, $ 0c96d, $ 0f90e, $ 0e92f, $099c8, $089e9, $ 0b98a, $ 0a9ab,
$05844, $04865, $07806, $06827, $018c0, $008e1, $03882, $028a3,
$ 0cb7d, $ 0db5c, $ 0eb3f, $ 0fb1e, $08bf9, $09bd8, $0 abbb, $ 0bb9a,
$04a75, $05a54, $06a37, $07a16, $00af1, $01ad0, $02ab3, $03a92,
$ 0fd2e, $ 0ed0f, $ 0dd6c, $ 0cd4d, $0 bdaa, $ 0ad8b, $09de8, $08dc9,
$07c26, $06c07, $05c64, $04c45, $03ca2, $02c83, $01ce0, $00c0,
$ 0ef1f, $ 0ff3e, $ 0cf5d, $ 0df7c, $ 0af9b, $0 bfba, $08fd9, $09ff8,
$06e17, $07e36, $04e55, $05e74, $02e93, $03eb2, $00ed1, $01ef0 );

Function CRCValue (AStr: string): Word;
Var
I: integer;
Begin
Result: = 0;
For I: = Length (AStr) downto 1 do
Result: = Hi (Result) xor CRC16Tab [byte (AStr [I]) xor Lo (Result)];
End;

Procedure StartCheckDup;
Var
I: integer;
Begin
SetLength (StrListArray, BufSize );
For I: = 0 to BufSize-1 do
StrListArray [I]. Clear;
End;

Function CheckDup (AStr: string): boolean;
Begin
With StrListArray [CRCValue (AStr)] do
Begin
Result: = (Count = 0) or (IndexOf (AStr) <0 );
If Result then
Append (AStr );
End;
End;

Procedure GenerateArray;
Var
I: integer;
Begin
SetLength (StrListArray, BufSize );
For I: = 0 to BufSize-1 do
StrListArray [I]: = TStringList. Create;
End;

Procedure FreeArray;
Var
I: integer;
Begin
For I: = 0 to BufSize-1 do
FreeAndNil (StrListArray [I]);
End;

Initialization
GenerateArray;
Finalization
FreeArray;
End.

Thank you for your enthusiastic help, especially yangfl (yangfl) jadeluo (xiufeng) zswang (with clear water)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.