R: How to filter large text files

Last Update:2014-09-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Filtering file data using the R language is common, but sometimes we encounter large files that cannot be fully read into the memory for processing, batch reading, batch filtering, and merging results are required. Here is an example to illustrate how R can filter big file data.

There is a 1 GB sales.txt file that stores a large number of order records. Please filter out the records with the amount field value between 2000 and 3000. The column delimiter of this file is "\ t". The first few rows of data are as follows:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/4A/wKiom1QSkf2AEDaDAADUVfGbNCk519.jpg "style =" float: none; "Title =" 1.jpg" alt = "wkiom1qskf2aedadaaduvfgbnck519.jpg"/>

R language solution:

Con <-file ("E: \ sales.txt", "R ")

Readlines (con, n = 1)

Result = read. Table (con, nrows = 100000, SEP = "\ t ")

Result <-result [result $ V4> = 2000 & Result $ V4 <= 3000,]

While (length (databatch <-read. Table (con, header = false, nrows = 100000, SEP = "\ t "))! = 0 ){

Databatch <-databatch [databatch $ V4> = 2000 & databatch $ V4 <= 3000,]

Result <-rbind (result, databatch)

}

Close (CON)

Partial calculation results:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/4A/wKiom1QSkf7BPpyZAAESn4r-PuA727.jpg "style =" float: none; "Title =" 2.jpg" alt = "wKiom1QSkf7BPpyZAAESn4r-PuA727.jpg"/>

Code explanation:

Line 1: open the file handle

Row 2: discard the first row, that is, the column name.

3-4 rows: Read the first batch of 100,000 pieces of data, filter the data, and store the result.

5-8 rows: cyclic reading. Each batch reads 100,000 rows of data, the filtered results are appended to the result variable, and then the next batch is read.

Row 9: Close the file handle.

Note:

If it is a small file, only one code can be used to read data. The first line can also be set as the column name of the Data box, but this is not true for large files, if you need to read data in batches, you cannot set the first row as the column name for the second batch of data. The default column names will be V1, V2, V3 ......

In order to read data in batches for large files, the while statement must be used to implement the algorithm, and the use of column names is not convenient enough, which makes the entire code slightly complicated.

Alternative solution:

The same algorithm can also be used in Python, set calculator, Perl, and other languages to solve this case. Like the r language, these languages can filter file data and compute structured data. Below is a brief introduction to the solutions for the set calculator and python.

The Set calculator automatically processes data in batches. programmers do not need to manually control the data using cyclic statements. Therefore, the code is very concise:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/4C/wKioL1QSkg-yY_2JAAB5IwAHMTs388.jpg "style =" float: none; "Title =" 3.jpg" alt = "wKioL1QSkg-yY_2JAAB5IwAHMTs388.jpg"/>

Cursor Is the data type used in the Set calculator for structured data computing. It is similar to the data box usage, but it is better at big files and complex computing. In addition, cursor can use the @ t option to read the first row of the file as the column name.

Python's code structure is similar to R, and it also controls loops manually. However, Python lacks structured data types such as data boxes or cursor, so the code is more underlying:

Result = []

Myfile = open ("E: \ sales.txt", 'R ')

Bufsize = 10240000

Myfile. Readline ()

Lines = myfile. readlines (bufsize)

Value = 0

While lines:

For line in lines:

Record = line. Split ('\ t ')

Amount = float (record [3])

If (amount> = 2000 and amount <= 3000 ):

Result. append (record)

Lines = myfile. readlines (bufsize)

Myfile. Close ()

Python can also use a third-party package to implement the above algorithm. For example, pandas has structured data objects similar to data boxes, but pandas has limited support for large files and it is difficult to further simplify the code.

R: How to filter large text files

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R: How to filter large text files

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support