creation and manipulation of vectors C () create vector length () mode () to determine the type of vector rbind () by row combination Vector Cbind () by column combination vector statistics: min () max () Sum () var () variance mean () SD () standard deviation prod () multiply to determine if a vector is.vector (x) vector value is read 1:9*2-1 the value from 1 to 9 is multiplied by 2 and minus 1a[1:5] read the first 1 to 5 values of vector a a[c (3,5,9)] read the value on the 3,5,9 position of vector a note: Must have C function a[a>30 & a<=45] Reading a sequence vector of values greater than 30 and equal to 45 in vector a generates: SEQ (5,20) 5 to 20 series tolerance 1seq (5,20,by=2) 5 to 20 series Tolerance 2seq (5,121,length (10)) Series 5 to 121, length 10, automatically calculates tolerance character vector: letters[1:30] Letters is 26 letters, the extra part is filled with NA which function gets the position of the data under the specified method, If Which.max (a) Gets the subscript of the maximum value in vector a which (a==2) gets the subscript of vector a equal to 2 which (a>5) gets the Subscript Rev () function of the vector a greater than 5 to reverse the vector sort () function Sort
Matrix creation and Operation Matrix (C (1:12), nrow=3,ncol=4) 1 to 12 of 3 rows of 4 columns in matrices of matrixes (1:12), nrow=3,ncol=4,byrow=true The option Byrow indicates whether the matrix should be populated by rows (byrow=true) or column-filled (byrow=false) matrices to transpose row columns, column-to-row a+b matrix additions, and the value of each element to add a-B matrix subtraction, Subtract the value of each element by multiplying the a%*%b matrix by Diag () find the value on the diagonal diag (a) the vector that returns the value on the diagonal diag (Diag (a)) returns the matrix of values on the diagonal diag (4) Returns a 4-order unit matrix Rnorm () a random number of normal distributions, such as Rnorm (16) generating 16 random numbers in the normal distribution matrix (Rnorm (), nrow=4,ncol=4) producing 16 normal random numbers and forming a 4 matrix Solve (a) the inverse matrix of the matrix Solve (A, a) to find the solution of matrices A and B eigen () to find the eigenvalues of the matrix such as a <-diag (4) +1 to produce a 4-order unit matrix and add the element 1a.e <-(eigen TRUE) to find eigenvalues and eigenvectors of matrices
An array is similar to a matrix, but the dimension can be greater than 2. Arrays can be created from the array function in the following form: MyArray <-Array (vector,dimension,dimnames) where the vector contains the data in the array, dimensions is a numeric vector, The maximum value of each dimension subscript is given, and Dimnames is an optional list of dimension name labels. Like a matrix, the data in an array can have only one pattern. The elements are selected from the array in the same way as the matrix.
The data frame can be created through the function data.frame (): MyData <-data.frame (col1,col2,col3, .... The column vectors col1, col2, Col3,... Can be of any type, such as a character, numeric, or logical type. The name of each column can be specified by the function names.
operation of the data set1. Merge DataSet A, add column merge () Cbind () Usage: C3 <-Merge (c1,c2,by={"Id1", "Id2"}) Two data frames are joined by one or more common variables, inner joincbind (C1,C2) Landscape Merging objects C1 and C2B, adding rows rbind c3 <-Rbind (dataframe1,dataframe2) Two data frames must have the same variables, but they do not have to be in the same order. If you have variables in Dataframea that are not in Dataframeb, do one of the following before merging them:? Delete the extra variables in the Dataframea; Create an appended variable in DATAFRAMEB and set its value to Na (missing). Vertical joins are typically used to add observations to a data frame. 2, take subset A, select the variable usage dataframes[row indices,col indices] such as take the C1 6 to 10 variables c1[,6:10]b, delete the variable method myvar <-names (leadership)%in% C (" Q3 "," Q4 ") NewData <-Leadership[!myvar] Process: (1) names (leadership) generates a character vector containing all variable names: C (" ManagerID "," TestDate "," Country "," gender "," age "," Q1 "," Q2 "," Q3 "," Q4 "," Q5 "). (2) names (leadership)%in% C ("Q3", "Q4") returns a logical vector, the value of each element in names (leadership) that matches Q3 or Q4 is true, and vice versa is False:c (false, False, False, False,false, False, False, True, True, false). (3) operation printable (!) reverses the logical value: C (True, True, true, True, False,false, true). (4) Leadership[c (True, True, True, True, True, true, true, FALSE, false,true)] Select a column with a logical value of true, so Q3 and Q4 are rejected. You can use a statement when you know that Q3 and Q4 are the 8th and 9th variables: newdATA <-leadership[c ( -8,-9)] remove them. This approach works by adding a minus sign (?) to a column before the subscript is removed. Finally, the same variable deletion work can be done by: leadership$q3 <-leadership$q4 <-null. This time you set the Q3 and Q4 two columns to undefined (NULL). Note that null is different from NA (indicating missing). The discard variable is the inverse of the reserved variable. Choosing Which way to do variable filtering depends on how easy it is to encode in two ways. If there are many variables that need to be discarded, it may be simpler to simply keep the variables left behind, and vice versa. C, subset acquisition NewData <-Leadership[which (leadership$gender== "M" & Leadership$age >30)] The simplest method: subset () NewData <-subset (Leadership,age >=45 | Age <24, SELECT=C (q1,q2,q3,q4)) NewData <-subset (Leadership,age >=45 | <24, Select=gender:q4)
Data sampling mysample <-leadership[sample (1:nrow (leadership), M,replace=true),]sample () The first parameter in the function is a vector of the elements to be sampled from. Here, the vector is 1 to the number of observations in the data frame, the second parameter is the number of elements to be extracted, and the third parameter indicates no back-up sampling. The sample () function returns the elements that were randomly sampled and can then be used to select rows in the data frame. After you import the SQLDF package using the SQL action data frame, you can manipulate the data using SQL, such as: NEWDF <-sqldf ("select * from Mtcars where am=1 order by mpg", row.names=true) Parameter Row.names=true extends the row name in the original data frame to the new data frame
Using the keyboard input data (1) Create an empty data frame: Mydataframe <-data.frame (age=numeric (0), Gender=character (0)) (2) Call text editor Mydataframe <-edit (Mydataframe) where you need to assign a value to the original variable, simple fix (mydataframe) reads the delimited text file Mydataframe <-read.table (file,heder=logic_ Value,sep= "delimiter", rowname= "name") is a delimited ASCII text file, the header is a logical value (TRUE or false) that indicates whether the first row contains the variable name. Sep is used to specify delimiters for separating data, and Row.names is an optional parameter that specifies one or more variables that represent the row identifier. Read the Excel library (RODBC) z <-odbcconnectexcel ("c:\\rtest\\mytest.xlsx") W <-SqlFetch (z, "Sheet1") Odbcclose (z) Reading database tables
Function description
Odbcconnect (dsn,uid= "", pwd= "") establishes an ODBC-to-data connection
SqlFetch (channel,sqltable) reads a table from an ODBC database into a data frame
SQLQuery (channel,query) submits a query to the ODBC database and returns the results
Sqlsave (Channel,mydf,tablename=sqltable,append=false) writes or updates a data frame (append=true) to a table in an ODBC database
Sqldrop (channel,sqltable) Delete a table in ODBC
Close (channel) Closed connection
Operation Process:
Library (RODBC)
Channel <-odbcconnect ("Sigbi", "ETL", "etl_etl213")
Etlinterface <-SqlFetch (channel,etl_interface) Copy table Etl_interface to Data frame etlinterface
Fundata <-sqlquery (channel, "SELECT * from Etl_l_log")
Close (channel)
Advanced Data Management
(i) Mathematical functions abs (x) absolute Value ABS (-4) return 4 sqrt (x) square root sqrt (25) The return value is 5 and 25^ (0.5) is equivalent ceiling (x) the smallest integer not less than x ceiling (3.475) The return value is 4 floor (x) Largest integer not greater than x floor (3.475) returns a value of 3 trunc (x) to 0 in the direction of the integer portion of x trunc (5.99) The return value is 5 round (x, Digits=n) rounds x to the decimal of the specified bit round (3.475, digits=2) The return value is 3.48 signif (x, Digits=n) rounds x to the specified number of significant digits signif (3.475, digits =2) Returns a value of 3.5 cos (x), sin (x), tan (x) cosine, sine and tangent cos (2) return value of –0.416 ACOs (x), ASIN (x), atan (x) inverse cosine, inverse sine and inverse tangent ACOs (-0.416) The return value is 2 cosh (x), Sinh (x), Tanh (x) hyperbolic cosine, hyperbolic sine and hyperbolic tangent Sinh (2) The return value is 3.627 Acosh (x), Asinh (x), Atanh (x) Inverse hyperbolic cosine, inverse hyperbolic sine and inverse hyperbolic tangent Asinh (3.627) returns a value of 2 log (x,base=n) log (x) log10 (x) pair x with n base logarithm for convenience log (x) is the natural logarithm log10 (x) The return value for commonly used logarithm such as log (10) is 2.3026LOG10 (10) Returns a value of 1 exp (x) exponential function exp (2.3026) Returns a value of ten
(b) Data statistical function mean (x) average mean (c (1,2,3,4)) The return value of 2.5 mean (x,trim=0.05,na.rm=true) discards the maximum 5% and a minimum of 5% of the data and all missing values after the arithmetic mean median (x) median Median (c (1,2,3,4)) return value is 2.5 SD (x) standard deviation SD (C (1,2,3,4)) return value is 1.29 var (x) Variance Var (c (1,2,3,4)) return value is 1.67 mad (x) absolute median difference (median Absolute deviation) Mad (C (1,2,3,4)) return value is 1.48 quantile (x,probs) for the number of bits. where x is the numeric vector of the number of probs to be divided into a numerical vector consisting of the probability value between [0,1] # for the 30% and 84% points y <-quantile (x, C (. 3,.84)) range x <-C (1,2,3,4 The range (x) return value is C (1,4) diff (Range (x)) and the return value is 3 sum (x) sum sum (c (1,2,3,4)) The return value is ten diff (x, lag=n) lag differential, and lag is used to specify the lag. The default lag value is 1 x<-C (1, 5, max) diff (x) The return value is C (4,, 6) min (x) for Min (c (1,2,3,4)) The return value is 1 max (x) to find the maximum Max (C (1,2,3,4)) return value X,center=true,scale=true (Center=true) or normalize (center=true,scale=true) for data object x by column for 4 scale
Standardization of data
By default, the function scale () is a normalized newdata <-scale (MyData) with a mean value of 0 and a standard deviation of 1 for a matrix or a specified column of a data frame.
You can standard the desired data, such as the standard deviation of SD, the average value of M data, as follows: NewData <-scale (MyData) *sd+m
probability function
D = density function (density) p = distribution function (distribution function) Q = (quantile functions) R = Generate random number (random deviation)
Distribution name abbreviation Distribution name abbreviation Beta distribution beta Logistic distribution logis Two-item distribution binom Multi-item Distribution multinom Cauchy distribution cauchy Negative two-item distribution Nbinom (non-center) Chi-square distribution chisq Normal Distribution Norm Index Distribution exp Poisson distribution POIs F distribution F Wilcoxon symbol rank distribution Signrank Gamma distribution gamma T Distribution T geometric distribution geom Uniform distribution unif hypergeometric distribution hyper Weibull distribution Weibull logarithmic normal distribution lnorm Wilcoxon rank and distribution Wilcox
Generate a pseudo-random number that obeys a normal distribution runif (5) generates 5 random numbers of 0~1 that meet the normal distribution, each generating a different result, and if you want to generate the same value, you need to specify the same random seed set.seed (1234) runif (5)
Multivariate normal distribution Library (MASS) options (digtis=3) set.seed (1234) mean <-C (230.7,146.7,3.6) Sigma <-Matrix (C (15360.8,6721.2 , -47.1,6721.2,4700.9,-16.5,-47.1,-16.5,0.3), nrow=3,ncol=3) MyData <-mvrnorm (500,mean,sigma) MyData <- As.data.frame (MyData) names (MyData) <-C ("Y", "x1", "x2")
(c) Character processing functions
NCHAR (x) calculates the number of characters in x x <-C ("AB", "CDE", "Fghij") length (x) The return value is 3 (see table 5-7) nchar (x[3]) The return value is 5
SUBSTR (x, Start, stop) extracts or replaces substrings in a character vector x <-"abcdef" substr (x, 2, 4) returns a value of "BCD" substr (x, 2, 4) <-"22222" (X will become "A222ef" ") grep (pattern, X, ignore.) Case=false,fixed=false) search for a pattern in X. If fixed=false, the pattern is a regular expression. If fixed=true, the pattern is a text string. The return value is the matching subscript grep ("A", C ("B", "A", "C"), Fixed=true) with a return value of 2
Sub (pattern, replacement, x,ignore.case=false, Fixed=false) searches for pattern in X and replaces it with a text replacement. If fixed=false, the pattern is a regular expression. If fixed=true, the pattern is a text string sub ("\\s", ".", "Hello There") with a return value of Hello.there. Note that "\s" is a regular expression used to find whitespace, and the reason for using "\\s" instead of "\" is that the latter is an escape character in R
Strsplit (x, Split, Fixed=false)
Splits the element in the character vector x at split. If fixed=false, the pattern is a regular expression. If fixed=true, the pattern is a text string y <-strsplit ("abc", "") will return a list of 1 components, 3 elements containing the contents of "a" "B" "C" unlist (y) [2] and sapply (Y, " [", 2) will return" B "
Paste (..., sep= "") connection string, the delimiter for Sep paste ("x", 1:3,sep= "") returns a value of C ("x1", "X2", "x3") Paste ("x", 1:3,sep= "M") return value C ("XM1", "xM2" " XM3 ") Paste (" Today is ", date ()) Returns a value of Today is Thu Jun25 14:17:32 2011
ToUpper (x) Uppercase conversion toupper ("ABC") return value "ABC" tolower (x) lowercase conversion tolower ("ABC") return value "ABC"
(d) Utility function length (x) object x the lengths of X <-C (2, 5, 6, 9) Length (x) The return value is 4
Seq (from, to, by) generates a sequence of values for indices <-seq (1,10,2) Indices C (1, 3, 5, 7, 9) Rep (x, N) to repeat x N times y <-Rep (1:3, 2) y for C (1, 2, 3, 1, 2, 3) cut (x, n) divides the continuous type variable x into a factor with n levels using the option Ordered_result = True to create an ordered factor pretty (x, N) to create an aesthetically pleasing split point. A continuous type variable x is divided into n intervals by selecting the N+1 value of equal spacing. Common Cat (..., file = "MyFile", append =false) connections in drawings ... and output it to the screen or file (if one is declared) FirstName <-C ("Jane") Cat ("Hello", FirstName, "\ n")
Practical methods
A apply () function is provided in R to "apply" an arbitrary function to any dimension of the matrix, array, data frame. The use format for the Apply function is:
Apply (X,margin,fun,...) where x is the data object, margin is the subscript of the dimension, fun is the function you specify, and ... Includes any arguments that you want to pass to the fun. In a matrix or data frame, margin=1 represents a row, and margin=2 represents a column.
Custom functions
IfElse (Cond,statment1,statment2) Results of two yuan can be simple and practical
Custom functions
MyFunction <-function (arg1,arg2,.....) The object in the {statments return (object)} function is used only inside the function. The data type of the returned object is arbitrary, from scalar to list.
Use the function warning () to generate an error message, use a message () to generate a diagnostic message, or stop the execution of the current expression with stop () and prompt for an error.
Consolidate data
The aggregate () function collapses (collapse) data using one or more by variables and a pre-defined function:
Aggregate (X,by,fun), the variable in by must be in a list (even if there is only one variable).
where x is the data object to be collapsed, by is a list of variable names, which are removed to form new observations, and fun is a scalar function used to calculate descriptive statistics, which will be used to calculate the values in the new observations. The value in the fun is calculated by the list in the specified by.
Reshape () package
1. The fusion of converged datasets is to refactor it into a format where each measurement variable is exclusively a row with the identifier variable to uniquely determine the measurement.
Library (reshape) MD <-Melt (mydata,id= (c ("id", "Time")))
2. Recast the cast () function to read the fused data and reshape it with the formula you provided and an (optional) function for consolidating the data
NewData <-cast (Md,formula,fun)
The MD is the fused data, formula describes the desired final result, and fun is the (optional) data integration function. Its accepted formula is as follows:
Rowvar1+rowvar2+.....~colvar1+colvar2+ .....
In this formula, ROWVAR1 + rowvar2 + ... Defines the set of variables to be crossed out to determine the contents of each row, while Colvar1 + Colvar2 + ... Defines the set of variables to be crossed out to determine the contents of each column.
R Language Basics