R Study Notes (4): use external data

Source: Internet
Author: User
Tags eol odbc connection one more line

In view of the non-persistent and capacity constraints of the memory, an effective data processing tool must be able to use external data: be able to obtain a large amount of data from the outside, and be able to save the processing results. R provides a series of functions for external data processing. The types of external data can be divided into files, databases, and networks. file operations can also be divided into import/export operations and stream operations.

Table of Contents
  • 1 Data box

    • 1.1 List
    • 1.2 data frame
    • 1.3 edit a Data box
  • 2. Import and export CSV files
    • 2.1 File Format
    • 2.2 read. table () and write. table ()
    • 2.3 scan () and cat ()
  • 3. Use connection)
    • 3.1 connection type
    • 3.2 output to connection
    • 3.3 input from connection
    • 3.4 binary connection
  • 4. Some specific file formats
  • 5. Use relational databases
    • 5.1 packages of DBI and RMySQL
    • 5.2 RODBC
  • 6. Network Interfaces and external tools
  • 7. Process Big Data
1 Data box

As mentioned above:

Lists and data frames are the generalization of vectors and matrices. A list can contain different types of elements or even objects as elements; the Data box allows different types of elements for each column. For lists and data boxes, the elements are usually called components ).

Because external data processing involves data boxes, the list and data boxes are described in more detail here.

1.1 List

The list components can be of different types. You can use the list () function to create a list:

> x = list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))> x$name[1] "Fred"$wife[1] "Mary"$no.children[1] 3$child.ages[1] 4 7 9

The list elements can be accessed through several different indexes:

> x[[1]][1] "Fred"> x[1][1]$name[1] "Fred"> x["name"][1]$name[1] "Fred"> x$name[1] "Fred"
1.2 data frame

A Data box is a special list and a structure similar to a matrix. In the data box, columns can be different objects. You can think of a Data box as a 'data matrix 'that represents an observed individual and (possibly) has both numerical and classification variables. rows and columns can be accessed through a matrix index.

The following is an example:

> L3 = LETTERS[1:3]> L3[1] "A" "B" "C"> d = data.frame(cbind(x = 1, y = 1:10), fac = sample(L3, 10, replace = TRUE))> d   x  y fac1  1  1   B2  1  2   A3  1  3   C4  1  4   C5  1  5   B6  1  6   A7  1  7   A8  1  8   B9  1  9   B10 1 10   A> d$x [1] 1 1 1 1 1 1 1 1 1 1> d[[2]] [1]  1  2  3  4  5  6  7  8  9 10> d[1][1]   x1  12  13  14  15  16  17  18  19  110 1

Data boxes are very useful tools. R also provides functions for merging data boxes. You can use the merge () function to merge two data boxes with the same columns. You can specify which columns to install for merge:

> x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)> y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)> x  k1 k2 data1 NA  1    12 NA NA    23  3 NA    34  4  4    45  5  5    5> y  k1 k2 data1 NA NA    12  2 NA    23 NA  3    34  4  4    45  5  5    5> merge(x,y)  k1 k2 data1  4  4    42  5  5    5> merge(x,y,by='k1')  k1 k2.x data.x k2.y data.y1  4    4      4    4      42  5    5      5    5      53 NA    1      1   NA      14 NA    1      1    3      35 NA   NA      2   NA      16 NA   NA      2    3      3> merge(x, y, by = c("k1","k2"))  k1 k2 data.x data.y1  4  4      4      42  5  5      5      53 NA NA      2      1
1.3 edit a Data box

The data. entry () function can open the data editor, but it is not applicable to data boxes. If data. entry () is used to modify the data box, it is converted to the list type.

You need to use the edit () function to edit the data box:

> The xnew = edit (xold) edit () function is only for editing and does not assign values. To directly modify the data box, use the following format:> x = edit (x)> fix (x) # equivalent to the preceding format
2. Import and export CSV files

In R, the read. table () function is used to process text files and read data into data frames. To control the import method in a complicated way, the old scan () function can be used (). Note: In addition to processing file imports, scan can also directly accept keyboard input.

At the beginning of this series, we mentioned the workspace. You can use the getwd () and setwd () functions to obtain/set the workspace directory. Use list. files () to view the files in the current directory.

For text files in a workspace, you can use relative paths. For other files, use absolute paths.

2.1 File Format

R supports a wide range of file formats, including CSV, FIX, DIF, XML, and binary formats such as DBF, XLS, HDF5, and netCDF.

For CSV files, R considers the most ideal format as follows:

  Price Floor Area Rooms Age Cent. heat
01 52.00 111.0 830 5 6.2 No
02 54.75 128.0 710 5 7.5 No
03 57.50 101.0 1000 5 4.2 No
04 57.50 131.0 690 6 8.8 No
05 59.75 93.0 900 5 1.9 Yes

That is, the name of each component in the first behavior data box, the first item in each subsequent row is the row label, and the rest is the data.

If the default format is not met, you must specify specific parameters in the import function.

2.2 read. table () and write. table ()

The most common method is to use the read. table () function and write. table () function to process the import and export of CSV files. The read () and write () functions can only process specific columns of a matrix or vector, while read. table () and write. table () functions can process data boxes containing row and column tags.

The read. talbe () function reads the file and returns a Data box:

read.table(file, header = FALSE, sep = "", quote = "\"'",           dec = ".", row.names, col.names,           as.is = !stringsAsFactors,           na.strings = "NA", colClasses = NA, nrows = -1,           skip = 0, check.names = TRUE, fill = !blank.lines.skip,           strip.white = FALSE, blank.lines.skip = TRUE,           comment.char = "#",           allowEscapes = FALSE, flush = FALSE,           stringsAsFactors = default.stringsAsFactors(),           fileEncoding = "", encoding = "unknown", text)

Some main parameters:

  • File: the file to be processed. You can use a string to specify a file name or a function, for example, file ('file. dat ', encoding = 'utf-8 ')
  • Header: whether the first line is a field name. If this parameter is not specified, read. table () is determined based on the row label. That is, if the first row has one fewer column than the following row, it is the header row.
  • Col. names: if specified, replace the column name in the first row with the specified name
  • Sep: Specifies the separator. By default, blank characters (spaces, tabs, line breaks, etc.) are used ). Can be specified as '', '\ t', etc.
  • Quote: Specifies a string separator, such as "or'
  • Na. strings: Specifies the defect value. The default value is NA.
  • Fill: whether or not the last line field is ignored in the file. If yes, it must be set to TRUE.
  • Strip. white: whether to remove the white space at the beginning and end of the string field
  • Blank. lines. skip: whether to ignore blank lines. The default value is TRUE. If you want to set this parameter to FALSE, both fill = TRUE must be specified.
  • ColClasses: Specifies the Data Type of each column.
  • Comment. char: annotator. # Is used as the annotation symbol by default. If there is no annotation in the file, specifying comment. char = "" is safer (or faster)

For ease of use, the read. table () function also provides some variants that set the default values for some parameters of read. table:

Read.csv (file, header = TRUE, sep = ",", quote = "\" ", dec = ". ", fill = TRUE, comment. char = "", ...20.read.csv 2 (file, header = TRUE, sep = ";", quote = "\" ", dec =", ", fill = TRUE, comment. char = "",...) read. delim (file, header = TRUE, sep = "\ t", quote = "\" ", dec = ". ", fill = TRUE, comment. char = "",...) read. delim2 (file, header = TRUE, sep = "\ t", quote = "\" ", dec =", ", fill = TRUE, comment. char = "",...) write. table () has fewer parameters: write. table (x, file = "", append = FALSE, quote = TRUE, sep = "", eol = "\ n", na = "NA", dec = ". ", row. names = TRUE, col. names = TRUE, qmethod = c ("escape", "double"), fileEncoding = "")

Some parameters are described as follows:

  • X name of the object to be written
  • File Name (the missing time object is directly "written" on the screen)
  • Whether append is an incremental write
  • Quote a logical or numeric vector: if it is TRUE, the numeric variables and factors are written in the double quotation mark; if quote is a numeric vector, it indicates the column label of the columns to be written in. (In both cases, the variable name is written in ""; if quote = FALSE, the variable name is not included in double quotation marks)
  • Field Separator in the sep File
  • Eol specifies the end of the line. The default value is '\ n'
  • Na indicates characters of missing data
  • Decimal point characters
  • Row. names: A logical value that determines whether the row name is written to a file.
  • Col. names a logical value (determines whether the column name is written to the file); or specify a vector type to be written to the file as the column name
  • Qmethod if quote = TRUE, this parameter is used to specify the double quotation marks in the variable type. "How to Deal with it: if the parameter value is" escape "(or" e ", the default value), each" replaces; if the value is "d", replace each "".

Similarly, write. table () provides some variants:

Write.csv (...)

Write.csv 2 (...)

Example: Save the preceding example as a file in the workspace and run the following command:

> x = read.table('sample.csv',sep='\t')> x  V1    V2    V3   V4    V5  V6        V71 NA Price Floor Area Rooms Age Cent.heat2  1 52.00 111.0  830     5 6.2        no3  2 54.75 128.0  710     5 7.5        no4  3 57.50 101.0 1000     5 4.2        no5  4 57.50 131.0  690     6 8.8        no6  5 59.75  93.0  900     5 1.9       yes

Use fix(x(after editing data, use write.table(x,'sample1.csv ') to save.

2.3 scan () and cat ()

Read. table () is very convenient, but the efficiency of processing large matrices is very low. For example, you can experiment with a matrix operation that is not very big (200x2000:

> Write. table (matrix (rnorm (200*2000), 200), "matrix. dat ", row. names = F, col. names = F)> A <-. matrix (read. table ("matrix. dat ") # It takes about 7 seconds

Read. table () calls scan () to read the file and process the result. If you use scan () to read data directly, the efficiency will be higher:

> A <-matrix (scan ("matrix. dat", n = 200*2000), 200,200 0, byrow = TRUE) # It takes about 2 seconds

When the matrix size is larger, this difference is more prominent. The scan () function is more flexible than read. table (). A major difference is that scan () can specify the type of the variable to avoid overhead caused by type verification:

scan(file = "", what = double(), nmax = -1, n = -1, sep = "",     quote = if(identical(sep, "\n")) "" else "'\"", dec = ".",     skip = 0, nlines = 0, na.strings = "NA",     flush = FALSE, fill = FALSE, strip.white = FALSE,     quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE,     comment.char = "", allowEscapes = FALSE,     fileEncoding = "", encoding = "unknown", text)

Similarly, cat () functions are more flexible than write. table:

cat(... , file = "", sep = " ", fill = FALSE, labels = NULL,    append = FALSE)
3. Use connection)

Connections in R provide a set of functions to flexibly point to interfaces similar to file objects, instead of using file names. Follow these steps:

  1. Create a connection
  2. Open connection
  3. Operation data
  4. Close connection

In R, the showConnections () function can be used to list connections opened by the current user. The showConnections (all = TRUE) function allows you to view the summary of all connections, including closed or terminated connections.

3.1 connection type

R can regard many data sources as connections, including:

  • The file () function creates a file connection to open a text file or binary file. For gzip or bzip2 compressed files, you can use the gzfile () and bzfile () functions to create connections.
  • In standard I/o r, stdin (), stdout (), and stderr () functions can be used to establish connections to standard I/O. These connections can be used directly without being opened and cannot be closed.
  • In character vector R, a character vector can even be used as input or output. Use the textConnection () function to create a connection to the Character Vector.
  • Pipelines in pipelines (Pipes) UNIX have extraordinary significance and can easily implement inter-process communication. The R function pipe () can create pipeline connections.
  • URL

URL-type http: //, ftp: // and // localhost/can read content through the function url. For convenience, file can also accept such file specifications and call URLs.

  • Socket

Function socketConnection () can create a socket connection

3.2 output to connection

Let's look at the example:

Zz <-file ("ex. data "," w ") # open an output file and connect to cat (" TITLE extra line "," 2 3 5 7 "," "," 11 13 17 ", file = zz, sep = "\ n") cat ("One more line \ n", file = zz) close (zz) # use a pipe (Unix) convert the decimal point to comma # R string and (possibly) in the output) in the SHELL script, the \ write zz <-pipe (paste ("sed s /\\\\. //,/> "," outfile ")," w ") cat (format (round (rnorm (100), 4), sep =" \ n ", file = zz) close (zz) # view the output file: file. show ("outfile", delete. file = TRUE) # capture R output: Use the zz <-textConnection ("ex. lm. out "," w ") sink (zz) example (lm, prompt. echo = ">") sink () close (zz) # Now 'ex. lm. out' contains the output content to be further processed # view the content, such as cat (ex. lm. out, sep = "\ n ")
3.3 input from connection

The basic functions for reading data from a connection are scan and readLines. These functions have a string as the input parameter. When a function is called, a file connection is opened, but the explicit file connection allows a file to be read in different formats consecutively. Other functions that call scan can also use connections, especially read. table. Some simple examples are as follows:

# Read the file readLines ("ex. data ") unlink (" ex. data ") # Read the current directory list (Unix) readLines (pipe (" ls-1 ") # Remove the trailing comma from the input file. # Assume that we have a file containing the following 'data', 450,390,467,654 # And then read the scan (pipe ("sed-e s/, $/data") using the following command "), sep = ",")

You can also use the pressure stack operation to input data from the connection. Similar to the ungetc function in C, the pushBack () function in R can push any data to the connection. The pushed data is stored in a stack (FILO ). If the stack is not empty, data is retrieved from the stack. If the stack is empty, data is input from the connection.

Example:

> zz <- textConnection(LETTERS)     > readLines(zz, 2)     [1] "A" "B"     > scan(zz, "", 4)     Read 4 items     [1] "C" "D" "E" "F"     > pushBack(c("aa", "bb"), zz)     > scan(zz, "", 4)     Read 4 items     [1] "aa" "bb" "G"  "H"     > close(zz)

The pressure stack operation is only applicable to text input mode connections.

3.4 binary connection

When you open a connection, use 'B' to set the binary mode, such as 'rb' and 'wb '. You can use the readBin () and writeBin () functions to read and write binary data. Function Description:

readBin(con, what, n = 1, size = NA, endian = .Platform$endian)writeBin(object, con, size = NA, endian = .Platform$endian)

Where:

  • The connection that con wants to open. If a string is given, it is assumed to be the file name.
  • What indicates the type/mode of the vector. For example, numeric, integer, logical, character, complex, or raw. You can use functions such as integer () or strings such as 'integer' as parameters.
  • N maximum number of elements to be read
  • Size indicates the number of bytes. For example, you can set the size to read and write 16-bit integers or single-precision real numbers.
  • The object to be written must be an atomic vector object, that is, a vector in the numeric, integer, logical, character, complex, or raw mode without attributes. By default, these write files with the same byte stream as the memory.
4. Some specific file formats

DBF File: read and write using the read. dbf () and write. dbf () functions.

XLS file: it is best to convert the file to csv and then import it. If you must use XLS directly, you can use RODBC for the operation. Refer to the database section below;

FIX file: Use read. fwf () and read. fortran () to import;

DIF file: Use read. DIF () to import; you can also use read. DIF ('clipboard') to read data from the clipboard;

XML file: the package XML provides support for xml files.

HDF5 file: Use the package hdf5 for processing

NetCDF file: Use the package RNetCDF for processing

Foreign Package provides some functions that can be imported into EpiInfo, Minitab, S-PLUS, SAS, SPSS, Stata, Systat, Ave ave and other software data files; Stata and SPSS data files can be exported.

5. Use relational databases

R provides packages for connecting to databases at different abstraction levels, such as DBI at the underlying layer, RMySQL at the upper layer, ROracle, RSQlite, and RODBC.

There is also a project that embeds R into PostgreSQL: http://www.joeconway.com/plr.

5.1 packages of DBI and RMySQL

MySQL is a common open source database. CRAN package RMySQL provides access to the MySQL database:

  • Use dbDriver ("MySQL") to obtain the database connection management object. Similarly, other packages also provide the dbDriver ("Oracle") and dbDriver ("SQLite") methods.
  • Call dbConnect to open a database connection
  • Use dbSendQuery () or dbGetQuery () to send a query. DbGetQuery transmits the query statement and returns the result in a data box. DbSendQuery transmits a query. The returned result is an object that inherits a subclass of "DBIResult. The "DBIResult" class can be used to obtain results. You can also call dbClearResult to clear the results.
  • Use the fetch () function to obtain some or all rows of the query results and return the results in a list. The function dbHasCompleted determines whether all rows have been obtained, while dbGetRowCount returns the number of rows in the result.
  • The functions dbReadTable and dbWriteTable can transmit data between the R Data box and the database table. The row name of the Data box is mapped to the rownames field of the MySQL table.
  • DbDisconnect () is used to close the database connection.

The following is an example:

> Library (RMySQL) # will load DBI as well # open a MySQL database connection> con <-dbConnect (dbDriver ("MySQL"), dbname = "test ") # list database tables> dbListTables (con) # import a data box to the database and delete any existing copies> data (USArrests)> dbWriteTable (con, "arrests ", USArrests, overwrite = TRUE) TRUE> dbListTables (con) [1] "arrests" # obtain the entire table> dbReadTable (con, "arrests ") murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19. 5... # query from the imported table> dbGetQuery (con, paste ("select row_names, Murder from arrests", "where Rape> 30 order by Murder ")) row_names Murder1 Colorado 7.92 Arizona 8.13 California 9.04 Alaska 10.05 New Mexico ICO 11.46 Michigan 12.17 Nevada 12.28 Florida 15.4> dbRemoveTable (con, "arrests")> dbDisconnect (con)
5.2 RODBC

The RODBC package in CRAN provides the ODBC access interface:

  • OdbcConnect or odbcDriverConnect (in the Windows graphical interface, you can select a database through the dialog box) can open a connection and return a handle for subsequent database access control ). Printing a connection gives the ODBC connection details, while calling odbcGetInfo gives the client and server details.
  • Detailed information about a table in a connection can be obtained through the sqlTables function.
  • Function sqlSave copies the R Data box to a database table, and function sqlFetch copies the table in a database to an R Data box.
  • When querying through sqlQuery, the returned result is the R Data box. (SqlCopy transmits a query to the database, and the returned results are saved as tables in the database .) A better control method is to first call odbcQuery and then use sqlGetResults to obtain the result. The latter can be used to obtain finite rows each time in a loop, just like the sqlFetchMore function.
  • The connection can be closed by calling the function close or odbcClose. The two functions can be called to close a connection without the R object or not after the R session, but there is a warning.

Sample Code:

> Library (RODBC) # Let the function map the name to lowercase> channel <-odbcConnect ("testdb", uid = "ripley", case = "tolower ") # import a data box to the database> data (USArrests)> sqlSave (channel, USArrests, rownames = "state", addPK = TRUE)> rm (USArrests) # list database tables> sqlTables (channel) TABLE_QUALIFIER TABLE_OWNER TABLE_NAME TABLE_TYPE REMARKS1 usarrests TABLE # list tables> sqlFetch (channel, "USArrests", rownames = "state ") murder assault urbanpop rapeAlabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5... # SQL query, originally in a row> sqlQuery (channel, "select state, murder from USArrests where rape> 30 order by murder ") state murder1 Colorado 7.92 Arizona 8.13 California 9.04 Alaska 10.05 New Mexico 11.46 Michigan 12.17 Nevada 12.28 Florida 15.4 # delete a table> sqlDrop (channel, "USArrests ") # Close connection> odbcClose (channel)

As a simple example of connecting to an Excel worksheet using ODBC in Windows, we can read the workbook as follows:

> Library (RODBC)> channel <-odbcConnectExcel ("bdr.xls") # list workbooks> sqlTables (channel) TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS1 C: \ bdr NA Sheet1 $ system table NA2 C: \ bdr NA Sheet2 $ system table NA3 C: \ bdr NA Sheet3 $ system table NA4 C: \ bdr NA Sheet1 $ Print_Area table na # obtain the content of Form 1. You can use any of the following methods> sh1 <-sqlFetch (channel, "Sheet1")> sh1 <-sqlQuery (channel, "select * from [Sheet1 $]")

Note that the database table specification is different from the name returned by sqlTables: sqlFetch can be mapped to this difference.

6. Network Interfaces and external tools

R provides very limited support for data exchange on the bottom layer of the network connection. However, it also provides some support:

The make. socket, read. socket, write. socket and close. socket functions provide socket support;

Function download. file Reads files from network resources through FTP or HTTP, and then writes them to a file;

Function read. both table and scan can directly read content from a URL. They either explicitly use a url to open a connection, or implicitly set a URL for the file parameter, without saving the file locally;

The CORBA package provides support for the CORBA protocol;

In addition, the DCOM protocol is also supported by a third party.

According to the UNIX philosophy, we do not recommend that you use these interfaces directly in R, but rather by using external tools. Here is an example of an external tool:

> Files <-system ("ls x *", intern = T) # You must specify intern.

7. Process Big Data

We have introduced some methods for R to use external data. This is usually enough. However, data obtained from the outside will be stored in the memory by R, which may cause problems when processing big data. When processing big data, you can use the following methods:

  1. The database reads a part of data from the database each time for processing.
  2. Although text files can be connected without a connection, the connection provides a "stream" method: Batch read/write.
  3. You can use UNIX tools such as grep, awk, and wc in Unix to pre-process text files and then read them into R. For example:
> howmany <- as.numeric(system ("grep -c ',C,' file.dat"))> totalrows <- as.numeric(strsplit(system("wc -l Week.txt", intern=T), split=" ")[[1]][1])
  1. If a large amount of data cannot be split using the virtual memory, it must be processed together. You can also use the "virtual memory ".

    Package filehash can store variables on disks rather than in memory.

    You can also use the database: Read files into the database, and then load the database as an environment to replace the file read into the memory. You can use the with () function to specify the environment.

> DumpDF (read. table ("large.txt", header = T), dbName = "mydb")> myenv <-db2env (db = "mydb")> with (mydb, z <-y + x) # specify mydb as the operating environment

Date: 2013-05-16 10:37:42 CST

Author: Holbrook

Org version 7.8.11 with Emacs version 24

Validate XHTML 1.0

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.