New packages for reading data into R-fast

Source: Internet
Author: User

Blessed are the little friends, April 10, 2015, Hadley Wickham Daniel (developed the famous Ggplots bag and Plyr bag, etc.) and the Rstudio group again, New works Readr Package and READXL package are used for R to read text data and Excel spreadsheet data respectively. In fact, R already has a bunch of functions that read the data, such as the Read.table family and its massive deformations, so why should the cows develop these two packages? The reason is simple, the two packets read faster than the R built-in data read Function!!! Remember, oh, it's a lot faster, huh? If you don't believe us, we'll find out if we try. haha! Usually reading small data of children's shoes may not have feelings, but the amount of data read into a large, fast is a very prominent advantage Ah, there are wood?! Needless to say, serving!

1) Readr Package Example

The READR package provides several functions for reading table/text data using R, and adds extra functionality and is faster! This is usually done with the read.table family function to accomplish these missions, now it can be a lot easier!

First of all, take a look at the first READR package in the function read_table, it replaced the previous read.table function, the key is faster, remember, fast, speed is the birth of this package is an important reason, may be driven by the trend of the Big data era! Let's do an experiment! Let these two functions simultaneously read a file containing 4 million navigation data (data address: http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt), See what interesting discoveries there are!

Step1

Looking at the data format, you can see that there are four columns representing the day, month, year, and a numeric value

Step2

Open R and run the following command to see how long the two commands run!

> System.time (read_table (file = ' http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt ') , Col_names = C (' Day ', ' MONTH ', ' Year ', ' TEMP '))

User system Elapsed

3.30 11.06 14.43

> System.time (read.table (file = ' http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt ') , Col.names = C (' Day ', ' MONTH ', ' Year ', ' TEMP '))

User system Elapsed

1.92 1.62 96.10

These two commands look similar, but the Read.table function takes about 96.1 seconds to complete, and read_table does not finish in 15 seconds (this could be the reason for my broken computer, the official saying: The former took 30 seconds or so, and the latter was done in less than a second!! Rub .... This performance ... Can't compare! )。 Maybe someone will ask, why is this? The reason is that the read_table function treats the data as a fixed format, and the underlying uses C + + to quickly process the data (in contrast, read.table supports any number of spaces between columns, and read_table requires that each column is neatly arranged, that there is no "early bird" in a column. ")。 However, that said, the actual use, and there is no such strict restrictions!

R Basic Package has a read fixed-width data set function, see below, again witness Readr package of magic, ON!!! It is so magical!!!

> system.time (Dat <-read_fwf (' Http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt ‘,

+ Fwf_widths (c (3,15,16,12),

+ Col_names=c ("Day", "MONTH", "Year", "TEMP")))

User system Elapsed

0.67 1.70 2.40

> system.time (dat2 <-read.fwf (' http://academic.udayton.edu/kissock/http/Weather/gsod95-current/ NYNEWYOR.txt ', C (3,15,16,12),

+ Col.names=c ("Day", "MONTH", "Year", "TEMP")))

User system Elapsed

0.73 0.49 89.03

See, this contrast, know readr bag of greasy harm!

Of course, above is just a simple example of the READR package! Other functions included in the READR are:

Readr::read_csv read a delimited file into a data frame.

Readr::read_file read a file into a string.

Readr::fwf_empty Read a fixed width file.

Readr::read_lines read lines from a file or string.

Readr::read_log read common/combined log file.

Readr::read_table read text file where columns is separated by whitespace.

2) READXL Package Example

For data in Excel format, which corresponds to the READXL package here, this package provides an Excel table with a read suffix of. xls and. xlsx format.

It is important to note that the READXL package is hosted on the HTTPS://GITHUB.COM/HADLEY/READXL, so install the address you want to specify is the READXL library on GitHub!

> Library (devtools) #先安装这个包, can quickly complete the installation of READXL package!!!

> Library (devtools)

> Devtools::install_github ("HADLEY/READXL")

Currently, the READXL package provides only read_excel functions, in the following format

Read_excel (spreadsheet, sheet=1, na,....)

How to use the method one can see, this is no longer wordy! Interested in the small partners hurriedly go to explore it yourself!!!

New packages for reading data into R-fast

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.