What is feather?
Feature is a file format that supports interactive storage of R languages and Python, faster. Currently supports the R language Data.frame and Python pandas dataframe.
Feather received support for the Apache Arrow project, Apache Arrow, a new open source project for the Apache Foundation and a top-notch project. It is designed as a cross-platform data layer to speed up the operation of big data analytics projects.
Features of Feather
Feather is a fast, lightweight, easy-to-use binary file format for storing data frames. The main features are as follows:
- Lightweight and easy to measure
- Language-Independent: Supports Python and R languages, and can also be read in other languages
- High performance reading and writing
Code Demo
My Computer hardware configuration: win7,64 bit operating system, 8G memory, CPU A6 dual core. Each person's computer configuration is different, the data read and write time is different. The reader can experiment with the code below and see for yourself.
feather
The package was introduced on March 29 in Rstudio's official blog. Because it was just posted on GitHub, Windows user installations need to be compiled and installed using the GCC 4.93 tool, which is cumbersome. The feather
package was officially released today in Cran, and now we just need to install it with the function under R 3.3.0 install.packages()
. No upgrade to R 3.3.0 versions of Windows users can refer to the article-hand in hand to teach you to upgrade r in a Windows environment. Let's try out feather
how fast the package is in R:
library(feather)x <- runif(1e7)x[sample(1e7, 1e6)] <- NA # 10%的NA值df <- as.data.frame(replicate(10, x))# 内存占用format(object.size(df), ‘MB‘)#[1] "762.9 Mb"#数据写出system.time(write_feather(df, ‘test.feather‘)) # 用户 系统 流逝 # 3.97 3.37 29.47 #数据导入system.time(read_feather(‘test.feather‘)) # 用户 系统 流逝 # 3.83 3.51 50.39 #查看下前几行数据data <- read_feather(‘test.feather‘)head(data)class(data) [1] "tbl_df" "tbl" "data.frame"
Originally to demonstrate feather
readr
the speed of the package and package comparison, but the computer configuration is not, readr
packet data written out spent nearly one hours have no movement, decisive give up. For the introduction of the readr
package of interested readers can refer to here
Summary
Feature is fast, but it is still in the development phase, and officials say it is not suitable for long-term storage, and does not guarantee compatibility with different versions. But it can be used for R and Python interactions, and it's pretty awesome. 762.9Mb data import takes only 50.39 seconds, feather package you deserve to have.
Reference article:
- Feather:a Fast on-disk Format for Data Frames-R and Python, powered by Apache Arrow
- Feather R language and python interactive hard disk storage format
This article is compiled by the snow-clear data network. Reprint please indicate this article link http://www.xueqing.tv/cms/article/210
Feather package for fast data frame reading and writing, you deserve to have