I. Factors and Levels
1. Simple and direct cognition factor and level
Factors can be simply understood as vectors that contain more information. That is, factor = vector + level. (In fact, they have different internal mechanisms). The level is a record of the different values in the vector, taking the following code as an example:
> x <-C (5, A, 5) > X[1] 5 12> XF <-factor (x) > Xf[1] [ 12] levels:5 c5>
But when we talk about the length of a factor, it's defined as the length of the data, not the number of levels.
> Length (XF) [1] 4
2, the factor of increase, delete, change, check (well here actually only increase )
When we increase the level of the factor, we need to insert it in advance, not as a matrix or as a list with the addition.
> x <-C (5,12,13,12) > XF <-factor (x) > Xff <-factor (x, Levels = C (5, B, D)) > Xff[1] 5 12 1 3 12levels:5 12 13 88
For example, using the following method, you will be prompted to insert illegally.
XFF[3] <-6Warning message:in ' [<-.factor ' (' *tmp* ', 3, value = 6): Invalid factor level, NA generated> Xff
The common functions of the two factors
1. tapply function
Typical tapply functions are tapply (x, F, g), X is vector, F is a list of factors or factors, and g () is the function required for x.
The process of performing an operation on the tapply function: first grouping x by Factor F, getting a number of sub-vectors, then using the G () function for each sub-vector, and finally returning a matrix of good class.
> Ages <-C (25,26,55,37,21,42) > Affils <-C ("R", "D", "D", "R", "U", "D") > Tapply (Ages, affils, mean) d
r
This is an example of the average age of each of the different parties (Democrats, Republicans, independents).
Then the following example further, in the case of two or more than two factors, it is necessary to use a list of factors to operate.
D <- data.frame (list (gender = C ("M", "M", "F", "M", "F", "F"), + age = C (+, A, (+), +), + income = C (55000, 88000, 32450, 76500, 123000, 45650)) > D Gender age income1 m 550002 m 880003 F 324504 M 765005 F 1230006 F 45650 > D$over25 <-ifelse (d$age > 1, 0) > D Gender Age income over251 M 55000 12
m 88000 ( f ) 32450 M 76500 123000- F 45650 0> tapply (d$income, List (D$gender, d$over25), mean) 0 1F 39050 123000.00M NA 73166.67
The above program realizes the average level of income by gender and age (two factors) respectively. So it's divided into four sub vectors:
- Men under 25 years of age
- Women under 25 years of age
- Men over 25 years old
- Women over 25 years old
What is more ingenious here is the addition of a column "over 25" to make a simple distinction between age, which greatly facilitates the use of the back of the Tapply ().
2. Split () function
Split () performs the function of grouping vectors by factor level and then returning a list. continue to the above data frame D operation.
Split (D$income, List (D$gender, d$over25)) $F. 0[1] 32450 45650$m.0numeric (0) $F. 1[1] 123000$m.1[1] 55000 88000 76500
Another question about the sex of abalone, we can quickly know which positions are the different sex of abalone through the split function.
Split (1:7, g) $F [1] 2 3 7$i[1] 4$m[1] 1 5 6
3. By () function
The by () function is similar to tapply (), but its object is not just a vector, but a matrix or a data frame. The following is an example of regression analysis using the by () function. Read the file from the text attached to the link (unfortunately, the data is too wife incomplete)
> Aba2 <-read.csv ("E:/files_for_r/abalone.data", Header = F) > #read. Table vs. csv:. Table default file contents separated by "/", ". csv" The default is "," > Colnames (ABA2) <- C ("Gender", "length", "diameter", "height", "Wholewt", "shuckedwt", "viscwt", " SHELLWT "," Rings ") > by (ABA2, Aba2$gender, function (m) LM (m[,2]~m[,3])) aba2$gender:fcall:lm (formula = m[, 2] ~ m[, 3]) C Oefficients: (Intercept) m[, 3] 0.04288 1.17918 ------------------------------------------------ ---------ABA2$GENDER:ICALL:LM (formula = m[, 2] ~ m[, 3]) coefficients: (Intercept) m[, 3] 0.02997 1.21833< c8/>---------------------------------------------------------aba2$gender:mcall:lm (formula = m[, 2] ~ m[, 3]) Coefficients: (Intercept) m[, 3] 0.03653 1.19480
The data given in the book is incomplete, missing a header, so I added a paragraph. Himself to play as follows.
Colnames (ABA2) <- C ("Gender", "length", "diameter", "height", "Wholewt", "shuckedwt", "viscwt", "SHELLWT", " Rings ") Three
Third, the operation of the table
1, about the table function in the R language
So far we have encountered two functions related to table: one is read.table () and the other is table (). Read.table () is used to read the data file, the default delimiter is "", and the table () function is a list of factors or factors to be processed to obtain a column table, that is, a method of recording the frequency.
2, table () function detailed operation
First, let's get this one. A data frame
> Ct <-data.frame (+ vote.for.x = Factor (c ("Yes", "yes", "no", "not sure", "no"), + voted.for.x = Factor (C ( "Yes", "no", "no", "yes", "no") + ) > Ct vote.for.x Voted.for.X1 Yes yes2 Yes NO3 No No4 not sure yes5 no No
After processing using the table () function, the following frequency tables are obtained.
> Cttab <-Table (CT) > Cttab voted.for.xvote.for.x No Yes no 2 0 not sure 0 1 Yes 1 1
Similarly, if you have three-dimensional data, table () can be played in a two-dimensional form. I don't want to give an example (lazy ...) here. )
3. Table of operations related to matrices and similar arrays
3.1 Accessing cell frequency
The operation here is actually the same as the list. Still take the above cttab as an example.
> class (Cttab) [1] "table" > cttab[,1] no not sure Yes 2 0 1 > class (Cttab) [1] "table" > cttab[,1] no not sure Yes 2 0 1 > cttab[1,1][1] 2
3.2 Equal proportions change cell frequency
> CTTAB/5 voted.for.xvote.for.x No Yes no 0.4 0.0 not sure 0.0 0.2 Yes 0.2 0.2
3.3 Get the boundary value of the table
- The boundary value of a variable: the value that is obtained by summing the value corresponding to the other variable when the variable is constant.
- A more straightforward approach is to implement directly through the Apply () function.
> Apply (cttab, 1, sum) no not sure Yes 2 1
- A more straightforward approach is to use the function addmargins (), which adds boundary values, to directly increase the boundary values of two dimensions.
> Addmargins (cttab) voted.for.xvote.for.x No yes Sum no 2 0 2 not sure 0 1 1 Yes 1 1 2 Sum 3 2 5
4. Expansion case: Find the highest frequency cell in the table
The whole function can be designed according to the following path:
- Add a new column of freq to represent the frequency of various types of data (this can be achieved by As.data.frame)
- Sorts the rows by the frequency size (implemented by the Order () function)
- Follow the required area before K line.
- The specific code is as follows:
Tabdom <-function (TBL, k) {#create A data frame representing TBL, add a Freq column tablframe <-as.data.frame (t BL) #determine the proper position of the frequencies in an ordered frequency#rearrange the data frame, get the first k row s tblfreord <-order (tablframe$freq, decreasing = TRUE) dom <-Tablframe[tblfreord,][1:k,] return ( DOM)}
R Language Programming Art _ the sixth chapter _ Factors and tables