Filtering file data using the R language is common, but sometimes we encounter large files that cannot be fully read into the memory for processing, batch reading, batch filtering, and merging results are required. Here is an example to illustrate how R can filter big file data.
There is a 1 GB sales.txt file that stores a large number of order records. Please filter out the records with the amount field value between 2000 and 3000. The column delimiter of this file is "\ t". The first few rows of data are as follows:
650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/4A/wKiom1QSkf2AEDaDAADUVfGbNCk519.jpg "style =" float: none; "Title =" 1.jpg" alt = "wkiom1qskf2aedadaaduvfgbnck519.jpg"/>
R language solution:
Con <-file ("E: \ sales.txt", "R ")
Readlines (con, n = 1)
Result = read. Table (con, nrows = 100000, SEP = "\ t ")
Result <-result [result $ V4> = 2000 & Result $ V4 <= 3000,]
While (length (databatch <-read. Table (con, header = false, nrows = 100000, SEP = "\ t "))! = 0 ){
Databatch <-databatch [databatch $ V4> = 2000 & databatch $ V4 <= 3000,]
Result <-rbind (result, databatch)
}
Close (CON)
Partial calculation results:
650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/4A/wKiom1QSkf7BPpyZAAESn4r-PuA727.jpg "style =" float: none; "Title =" 2.jpg" alt = "wKiom1QSkf7BPpyZAAESn4r-PuA727.jpg"/>
Code explanation:
Line 1: open the file handle
Row 2: discard the first row, that is, the column name.
3-4 rows: Read the first batch of 100,000 pieces of data, filter the data, and store the result.
5-8 rows: cyclic reading. Each batch reads 100,000 rows of data, the filtered results are appended to the result variable, and then the next batch is read.
Row 9: Close the file handle.
Note:
If it is a small file, only one code can be used to read data. The first line can also be set as the column name of the Data box, but this is not true for large files, if you need to read data in batches, you cannot set the first row as the column name for the second batch of data. The default column names will be V1, V2, V3 ......
In order to read data in batches for large files, the while statement must be used to implement the algorithm, and the use of column names is not convenient enough, which makes the entire code slightly complicated.
Alternative solution:
The same algorithm can also be used in Python, set calculator, Perl, and other languages to solve this case. Like the r language, these languages can filter file data and compute structured data. Below is a brief introduction to the solutions for the set calculator and python.
The Set calculator automatically processes data in batches. programmers do not need to manually control the data using cyclic statements. Therefore, the code is very concise:
650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/4C/wKioL1QSkg-yY_2JAAB5IwAHMTs388.jpg "style =" float: none; "Title =" 3.jpg" alt = "wKioL1QSkg-yY_2JAAB5IwAHMTs388.jpg"/>
Cursor Is the data type used in the Set calculator for structured data computing. It is similar to the data box usage, but it is better at big files and complex computing. In addition, cursor can use the @ t option to read the first row of the file as the column name.
Python's code structure is similar to R, and it also controls loops manually. However, Python lacks structured data types such as data boxes or cursor, so the code is more underlying:
Result = []
Myfile = open ("E: \ sales.txt", 'R ')
Bufsize = 10240000
Myfile. Readline ()
Lines = myfile. readlines (bufsize)
Value = 0
While lines:
For line in lines:
Record = line. Split ('\ t ')
Amount = float (record [3])
If (amount> = 2000 and amount <= 3000 ):
Result. append (record)
Lines = myfile. readlines (bufsize)
Myfile. Close ()
Python can also use a third-party package to implement the above algorithm. For example, pandas has structured data objects similar to data boxes, but pandas has limited support for large files and it is difficult to further simplify the code.
R: How to filter large text files