Quickly scan text files, count the number of rows, and return the index location of each row (Delphi, C #)

Source: Internet
Author: User

The project needs to scan 12 million lines of text files. According to the guidance and tests provided by netizens, the gap between C # and Delphi is not large. Not to mention, columnCodeTest:

The following is the Delphi code:

 

// Retrieve the number of times a carriage return occurs in a file
Function Scanenterfile ( Const Filename: String ): Tint64array;
VaR
Myfile: tmemorystream; // File Memory
Rarray: tint64array; // row index result set
Size, curindex: int64; // file size, current stream location
Entercount: int64; // enter the number
Doloop: Boolean; // whether to continue the loop
PC: pchar;
Arraycount: int64; // size of the current Index Array
Addstep: integer; // the step to be added when the carriage return string is detected
Begin
If Filename = '' Then
Exit;
If Not Fileexists (filename) Then
Exit;
Myfile: = tmemorystream. Create; // create a stream
Myfile. loadfromfile (filename); // map the inbound port to the myfile object.
Size: = myfile. size;
PC: = myfile. Memory; // point the character pointer to the memory stream
Curindex: = rowleast;
Doloop: = true;
Entercount: = 0 ;
Setlength (rarray, perarray );
Arraycount: = perarray;
Entercount: = 0 ;
Rarray [entercount]: = 0 ;
While Doloop Do
Begin
Addstep: = 0 ;
If (Ord (PC [curindex]) = 13 ) Then
Addstep: = 2 ;
If (Ord (PC [curindex]) = 10 ) Then
Addstep: = 1 ;
// Handle carriage return
If (Addstep <> 0 ) Then
Begin
Application. processmessages;
// Add a record
INC (entercount );
// Determine whether to increase the Array
If (Entercount MoD Perarray = 0 ) Then
Begin
Arraycount: = arraycount + perarray;
Setlength (rarray, arraycount );
End ;
Rarray [entercount]: = curindex + addstep;
Curindex: = curindex + addstep + rowleast;
End
Else
Curindex: = curindex + 2 ;
If Curindex> size Then
Doloop: = false
Else
Doloop: = true;
End ;
Result: = rarray;
Freeandnil (myfile );
End ;

Run the Code:

 

Procedure Tmainform. btn2click (Sender: tobject );
VaR
Datasindex: tint64array; // data file index
Begin

T1: = gettickcount;
Datasindex: = scanenterfile ('R: \ 201201_datafile.txt');
Caption: = caption +'::'+ Inttostr (GetTickCount-t1 );
End;

The execution result is: 16782 MS

 

Below is the code for C:

 

///   <Summary>
/// Scan a text file, calculate the number of rows, and return the start pointer array of each row (1.2kw data speed is 10 seconds faster than the array used)
///   </Summary>
///   <Param name = "FILENAME"> File Name </Param>
///   <Param name = "rowcount"> Number of rows </Param>
///   <Param name = "rowleast"> Minimum length of a row </Param>
///   <Param name = "inccount"> Increment the number of index arrays </Param>
///   <Param name = "initcount"> Number of row indexes initialized for the first time </Param>
///   <Returns> Index list </Returns>
Public Static Ilist < Long > Scanenterfile ( String Filename, Out Int Rowcount, Int Rowleast, threadprogress progress)
{
Rowcount = 0 ;
If ( String . Isnullorempty (filename ))
Return Null ;
If (! System. Io. file. exists (filename ))
Return Null ;
Filestream myfile = New Filestream (filename, filemode. Open, fileaccess. Read, fileshare. read, 8 ); // Read files into the stream
Ilist < Long > Rlist = New List < Long > ();
Int Entercount = 0 ; // Carriage return quantity
Int Checkvalue;
Int Addstep;
Myfile. Position = rowleast;
Checkvalue = myfile. readbyte ();
While (Checkvalue! =- 1 )
{
// Application. doevents ();
Addstep =- 1 ;
// The current position of the file has been shifted since the file is readbyte.
// Therefore, if it is the first character of the carriage return, one character is required.
// If it is the second character of the carriage return, one character is not required.
If (Checkvalue = 13 )
Addstep = 1 ;
Else If (Checkvalue = 10 )
Addstep = 0 ;
If (Addstep> = 0 )
{
Entercount ++;
Rlist. Add (myfile. Position + addstep );
Myfile. Seek (rowleast + addstep, seekorigin. Current );
Progress (entercount );
}
Else Myfile. Seek ( 2 , Seekorigin. Current );
Checkvalue = myfile. readbyte ();
}
Rowcount = entercount + 1 ;
Return Rlist;
}

Executed code:

 

Stopwatch stopwatch = New Stopwatch ();
Stopwatch. Start ();
Int Rowcount;
Filehelper. scanenterfile ( @" R: \ 201201_datafile.txt " , Out Rowcount, 35 , Outputprogress );
Usetime = stopwatch. elapsedmilliseconds;

The execution result is:

124925 MS

(After criticism and guidance from many netizens, this method does not read files from the memory, but reads files one byte by byte, which is much slower than the method used to read Delphi bytes into the memory. This method is only suitable for old machines. If the memory is not enough, the current memory is very cheap. Therefore, this method is out of date. The following is some advice from netizens, the Readline method is used. The speed is about 6 seconds .)

 

Public Static Ilist < Long > Scanenterfile ( String Filename, threadprogress progress)
{
If ( String . Isnullorempty (filename ))
Return Null ;
If (! System. Io. file. exists (filename ))
Return Null ;
Ilist < Long > Rlist = New List < Long > ();
Rlist. Add ( 0 );
Streamreader sr = file. opentext (filename );
String Rstr = Sr. Readline ();
While ( Null ! = Rstr)
{
Rlist. Add (rlist [rlist. Count- 1 ] + Rstr. Length + 2 );
Rstr = Sr. Readline ();
Progress (rlist. Count );
}
Sr. Close ();
Return Rlist;
}

 

After testing, if the method has a Chinese character encoding, its location is incorrect. After finding a solution, you can update it.

After testing, the use of ilist in C # is faster than that in the array.

Summary: Everything has its own value. As for the choice of the official portal, you can choose based on your own needs. Here, I will not be biased towards any party. Everything is not important.

Original Works are from hard work, please describe them for reprintingArticleSource:Http://blog.csdn.net/kfarvidOrHttp://www.cnblogs.com/kfarvid/ 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.