R Language Programming Art _ the sixth chapter _ Factors and tables

Source: Internet
Author: User

I. Factors and Levels

1. Simple and direct cognition factor and level

Factors can be simply understood as vectors that contain more information. That is, factor = vector + level. (In fact, they have different internal mechanisms). The level is a record of the different values in the vector, taking the following code as an example:

> x <-C (5, A, 5) > X[1]  5 12> XF <-factor (x) > Xf[1] [  12] levels:5 c5>

But when we talk about the length of a factor, it's defined as the length of the data, not the number of levels.

> Length (XF) [1] 4

2, the factor of increase, delete, change, check (well here actually only increase )

When we increase the level of the factor, we need to insert it in advance, not as a matrix or as a list with the addition.

> x <-C (5,12,13,12) > XF <-factor (x) > Xff <-factor (x, Levels = C (5, B, D)) > Xff[1] 5  12 1 3 12levels:5 12 13 88

For example, using the following method, you will be prompted to insert illegally.

XFF[3] <-6Warning message:in ' [<-.factor ' (' *tmp* ', 3, value = 6): Invalid factor level, NA generated> Xff

The common functions of the two factors

1. tapply function

Typical tapply functions are tapply (x, F, g), X is vector, F is a list of factors or factors, and g () is the function required for x.

The process of performing an operation on the tapply function: first grouping x by Factor F, getting a number of sub-vectors, then using the G () function for each sub-vector, and finally returning a matrix of good class.

> Ages <-C (25,26,55,37,21,42) > Affils <-C ("R", "D", "D", "R", "U", "D") > Tapply (Ages, affils, mean) d
   
    r  
   

This is an example of the average age of each of the different parties (Democrats, Republicans, independents).

Then the following example further, in the case of two or more than two factors, it is necessary to use a list of factors to operate.

D <-  data.frame (list (gender = C ("M", "M", "F", "M", "F", "F"), + age = C (+, A,                       (+), +), +                       income = C (55000, 88000, 32450, 76500, 123000, 45650)) > D  Gender age income1      m 550002      m  880003      F  324504      M  765005      F  1230006      F  45650 > D$over25 <-ifelse (d$age > 1, 0) > D Gender Age  income over251      M  55000      12
   m  88000 (      f      )  32450      M  76500  123000-      F  45650      0> tapply (d$income, List (D$gender, d$over25), mean)      0         1F 39050 123000.00M    NA  73166.67

The above program realizes the average level of income by gender and age (two factors) respectively. So it's divided into four sub vectors:

    • Men under 25 years of age
    • Women under 25 years of age
    • Men over 25 years old
    • Women over 25 years old

 What is more ingenious here is the addition of a column "over 25" to make a simple distinction between age, which greatly facilitates the use of the back of the Tapply ().

2. Split () function

Split () performs the function of grouping vectors by factor level and then returning a list. continue to the above data frame D operation.

Split (D$income, List (D$gender, d$over25)) $F. 0[1] 32450 45650$m.0numeric (0) $F. 1[1] 123000$m.1[1] 55000 88000 76500

Another question about the sex of abalone, we can quickly know which positions are the different sex of abalone through the split function.

Split (1:7, g) $F [1] 2 3 7$i[1] 4$m[1] 1 5 6

3. By () function

The by () function is similar to tapply (), but its object is not just a vector, but a matrix or a data frame. The following is an example of regression analysis using the by () function. Read the file from the text attached to the link (unfortunately, the data is too wife incomplete)

> Aba2 <-read.csv ("E:/files_for_r/abalone.data", Header = F) > #read. Table vs. csv:. Table default file contents separated by "/", ". csv" The default is "," > Colnames (ABA2) <-  C ("Gender", "length", "diameter", "height", "Wholewt", "shuckedwt", "viscwt", " SHELLWT "," Rings ") > by (ABA2, Aba2$gender, function (m) LM (m[,2]~m[,3])) aba2$gender:fcall:lm (formula = m[, 2] ~ m[, 3]) C Oefficients: (Intercept)       m[, 3]      0.04288      1.17918  ------------------------------------------------ ---------ABA2$GENDER:ICALL:LM (formula = m[, 2] ~ m[, 3]) coefficients: (Intercept)       m[, 3]      0.02997      1.21833< c8/>---------------------------------------------------------aba2$gender:mcall:lm (formula = m[, 2] ~ m[, 3]) Coefficients: (Intercept)       m[, 3]      0.03653      1.19480  

The data given in the book is incomplete, missing a header, so I added a paragraph. Himself to play as follows.

Colnames (ABA2) <-  C ("Gender", "length", "diameter", "height", "Wholewt", "shuckedwt", "viscwt", "SHELLWT", " Rings ") Three

Third, the operation of the table

1, about the table function in the R language

So far we have encountered two functions related to table: one is read.table () and the other is table (). Read.table () is used to read the data file, the default delimiter is "", and the table () function is a list of factors or factors to be processed to obtain a column table, that is, a method of recording the frequency.

2, table () function detailed operation

First, let's get this one. A data frame

> Ct <-data.frame (+   vote.for.x = Factor (c ("Yes", "yes", "no", "not sure", "no"), +   voted.for.x = Factor (C ( "Yes", "no", "no", "yes", "no") +   ) > Ct  vote.for.x Voted.for.X1        Yes         yes2        Yes          NO3         No          No4 not   sure         yes5         no          No

After processing using the table () function, the following frequency tables are obtained.

> Cttab <-Table (CT) > Cttab          voted.for.xvote.for.x No Yes  no        2   0 not  sure  0   1  Yes       1   1

Similarly, if you have three-dimensional data, table () can be played in a two-dimensional form. I don't want to give an example (lazy ...) here. )

3. Table of operations related to matrices and similar arrays

3.1 Accessing cell frequency

The operation here is actually the same as the list. Still take the above cttab as an example.

> class (Cttab) [1] "table" > cttab[,1]      no not sure      Yes        2        0        1 > class (Cttab) [1] "table" > cttab[,1]      no not sure      Yes        2        0        1 > cttab[1,1][1] 2

3.2 Equal proportions change cell frequency

> CTTAB/5          voted.for.xvote.for.x  No Yes  no       0.4 0.0 not  sure 0.0 0.2  Yes      0.2 0.2

3.3 Get the boundary value of the table

    • The boundary value of a variable: the value that is obtained by summing the value corresponding to the other variable when the variable is constant.
    • A more straightforward approach is to implement directly through the Apply () function.
> Apply (cttab, 1, sum)      no not sure      Yes        2        1        
    • A more straightforward approach is to use the function addmargins (), which adds boundary values, to directly increase the boundary values of two dimensions.
> Addmargins (cttab)          voted.for.xvote.for.x No yes Sum  no        2   0   2 not  sure  0   1   1  Yes       1   1   2  Sum       3   2   5

4. Expansion case: Find the highest frequency cell in the table

The whole function can be designed according to the following path:

    • Add a new column of freq to represent the frequency of various types of data (this can be achieved by As.data.frame)
    • Sorts the rows by the frequency size (implemented by the Order () function)
    • Follow the required area before K line.
    • The specific code is as follows:
Tabdom <-function (TBL, k) {#create A data frame representing TBL, add a Freq column  tablframe <-as.data.frame (t BL) #determine the proper position of the frequencies in an ordered frequency#rearrange the data frame, get the first k row s  tblfreord <-order (tablframe$freq, decreasing = TRUE)  dom <-Tablframe[tblfreord,][1:k,]  return ( DOM)}

  

R Language Programming Art _ the sixth chapter _ Factors and tables

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.