Quickly scan text files, count rows, and return the index position of each row (Delphi, C #)

Source: Internet
Author: User
Tags rowcount

12 million lines of text files need to be scanned by the project. Through the user's guidance and testing, found that the gap between C # and Delphi is not large. Not much to say, the Column code test:

Here's the Delphi code:

Traverse file to find the number of return occurrences
function Scanenterfile (const filename:string): Tint64array;
Var
myfile:tmemorystream;//file Memory
Rarray:tint64array; Row index result set
size,curindex:int64;//file size, current stream position
entercount:int64;//Return Quantity
doloop:boolean;//whether to continue the loop
Pc:pchar;
arraycount:int64;//Current Index array size
addstep:integer;//the stepping to be added when a carriage return string is detected
Begin
If FileName = "Then
Exit;
If not fileexists (FileName) Then
Exit;
myfile:=tmemorystream.create;//Creating a Stream
Myfile.loadfromfile (fileName);//Map the inflow port to the MyFile object
Size:=myfile.size;
Pc:=myfile.memory; Point the character pointer to the memory stream
Curindex:=rowleast;
Doloop:=true;
entercount:=0;
SetLength (Rarray,perarray);
Arraycount:=perarray;
entercount:=0;
rarray[entercount]:=0;
While Doloop do
Begin
addstep:=0;
if (Ord (Pc[curindex]) =13) Then
addstep:=2;
if (Ord (Pc[curindex]) =10) Then
Addstep:=1;
Handle a carriage return.
if (addstep<>0) then
Begin
Application.processmessages;
Add a row of records
Inc (Entercount);
Determine if you need to increase the array
if (entercount mod perarray=0) Then
Begin
Arraycount:=arraycount+perarray;
SetLength (Rarray,arraycount);
End
Rarray[entercount]:=curindex+addstep;
Curindex:=curindex+addstep+rowleast;
End
Else
curindex:=curindex+2;
If curindex> size Then
Doloop:=false
Else
Doloop:=true;
End
Result:=rarray;
Freeandnil (MyFile);
End

Execute code:

Procedure Tmainform.btn2click (Sender:tobject);
Var
datasindex:tint64array;//Data File Index
Begin

T1:=gettickcount;
Datasindex:=scanenterfile (' R:\201201_dataFile.txt ');
caption:=caption+ ':: ' +inttostr (GETTICKCOUNT-T1);
End

The execution result is: 16782 ms

Here's the code for C #:

<summary>
Scans a text file, makes a count of rows, and returns an array of start pointers for each line (1.2KW data speed is 10 seconds faster than using an array)
</summary>
<param name= "filename" > file name </param>
<param name= "RowCount" > Number of Rows </param>
<param name= "Rowleast" > Line Minimum Length </param>
<param name= "Inccount" > Increment index array number </param>
<param name= "Initcount" > First initialization of row index number </param>
<returns> Index List </returns>
public static ilist<long> Scanenterfile (string fileName, out int rowCount, int rowleast,threadprogress progress)
{
RowCount = 0;
if (string. IsNullOrEmpty (FileName))
return null;
if (! System.IO.File.Exists (FileName))
return null;
FileStream myFile = new FileStream (FileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8);//Read the file into the stream
Ilist<long> rlist=new list<long> ();
int entercount = 0;//return quantity
int CheckValue;
int addstep;
Myfile.position = Rowleast;
CheckValue = Myfile.readbyte ();
while (CheckValue! =-1)
{
Application.doevents ();
Addstep =-1;
Since the file readbyte, its current position has shifted backwards.
Therefore, if it is the first character of the carriage return, one bit will be passed.
And if it is the second character of the carriage return, then a
if (CheckValue = = 13)
Addstep = 1;
else if (CheckValue = = 10)
Addstep = 0;
if (addstep >= 0)
{
entercount++;
Rlist.add (myfile.position + addstep);
Myfile.seek (Rowleast + addstep, seekorigin.current);
Progress (Entercount);
}
else Myfile.seek (2, seekorigin.current);
CheckValue = Myfile.readbyte ();
}
RowCount = Entercount + 1;
return rList;
}

Code to execute:

Stopwatch Stopwatch = new Stopwatch ();
Stopwatch. Start ();
int rowCount;
Filehelper.scanenterfile (@ "R:\201201_dataFile.txt", out RowCount, outputprogress);
Usetime = stopwatch. Elapsedmilliseconds;

The execution results are:

124925 ms

(After a lot of netizens criticism and guidance, the method does not read the file in memory, but read bytes by byte, faster than the Delphi bytes read into the memory of the method is much slower. This method is only suitable for the old machine, memory is not enough, today's memory is very cheap, so, the method is now obsolete, the following through the Netizen's guidance, using the method of ReadLine, the speed is about 6 seconds. )

public static ilist<long> Scanenterfile (string fileName, threadprogress progress)
{
if (string. IsNullOrEmpty (FileName))
return null;
if (! System.IO.File.Exists (FileName))
return null;
ilist<long> rList = new list<long> ();
Rlist.add (0);
StreamReader sr = File.OpenText (fileName);
String rStr = Sr. ReadLine ();
while (null! = RSTR)
{
Rlist.add (Rlist[rlist.count-1] + rstr.length + 2);
RStr = Sr. ReadLine ();
Progress (Rlist.count);
}
Sr. Close ();
return rList;
}

After testing, the method has a wrong position if there is a Chinese character encoding. After finding a workaround, update it later.

After testing, C # uses ilist<t> faster than arrays.

Summary: Anything has its existence value, as for the crossing door choose what, according to their own needs, to choose, here, I will not have any preference on which side. Anyway, it's not important to do anything.

Original works from the efforts to lazy, reproduced please indicate the source of the article: Http://blog.csdn.net/kfarvid or Http://www.cnblogs.com/kfarvid/

Http://www.cnblogs.com/kfarvid/archive/2012/01/12/2320692.html

Quickly scan text files, count rows, and return the index position of each row (Delphi, C #)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.