R language used in the Data frame box operation! _ Data Mining

Source: Internet
Author: User

We do data analysis, data mining commonly used in the R language to deal with, and the use of good or bad often related to the proficiency of the function, the following we have a small series of Holy Sage Summary of the R language commonly used in the data frame of the basic operation.

The concept of Data frame

The data frame is generally translated as a box, feeling like a table in R, consisting of rows and columns, and unlike the matrix, each column can be of different data types, and the matrix must be the same.

Data frame Each column has a column name, and each row can also specify a row name. If you do not specify a row name, it is the sequence that starts at 1 to identify each row.


You can initialize a data Frame by using the Data.frame function. For example, we're going to initialize a student data frame that contains the ID and name plus gender and birthdate, so the code is:

Student<-data.frame (Id=c (11,12,13), Name=c ("Devin", "Edward", "Wenli"), Gender=c ("M", "M", "F"), Birthdate=c (" 1984-12-29″, "1983-5-6", "1986-8-8")

Alternatively, you can use Read.table () read.csv () to read a text file and return a data frame object. Reading the database also returns the data frame object.

View the contents of student as:

ID Name Gender Birthdate

1 One Devin M 1984-12-29

2 Edward M 1983-5-6

3 Wenli F 1986-8-8

Only the column names Id,name,gender and birthdate are specified here, and the names function allows you to view the column names, and you need to use the Row.names function if you want to see the row names. Here we want to use the ID as the row name, so we can write this:

Row.names (student) <-student$id

An easier way to do this is to initialize the date.frame with parameters row.names you can set the vector for the row names.

accessing elements

As with the matrix, you can access specific elements using the format of row index, column index.

For example, visit the first line:


To access the second column:


You can select which columns you want to access by using the column's index or column name. For example, to ID and name, then the code is:


Or is

Idname<-student[c ("ID", "Name")]

If you are accessing only one column and the vector type is returned, you can use [[or $] to access it. For example, we want all student's name, code:

NAME<-STUDENT[[2]] or name<-student[["Name"] or Name<-student$name

Use the attach and detach functions to make it possible to access a column without always following the variable name in the front.

For example, to print all name, you can write:

Attach (Student)

Print (Name)

Detach (Student)

You can also use the WITH function in a simpler way:

With (student,{


Print (n)


The n scope here is only within curly braces, and if you want to assign values to global variables in the WITH function, you need to use an operator such as <<-.

modifying column data types

Next we look at the type of each column of the object, using str (student) to get the following results:

' Data.frame ': 3 OBS 4 variables:

$ id:num 1 2 3

$ name:factor W/3 Levels "Devin", "Edward",..: 1 2 3

$ gender:factor W/2 Levels "F", "M": 2 2 1

$ birthdate:factor W/3 Levels "1983-5-6″, 1984-12-29",..: 2 1 3

By default, the string vectors are automatically recognized as factor, that is, the ID is a numeric type, and the other 3 columns are defined as factor types. Obviously the name here should be a string type, birthdate should be the date type, and we need to make changes to the data type of the column:

Student$name<-as.character (Student$name)

Student$birthdate<-as. Date (Student$birthdate)

Here we run STR (student) to see the modified result:

' Data.frame ': 3 OBS 4 variables:

$ id:num 11 12 13

$ NAME:CHR "Devin" "Edward" "Wenli"

$ gender:factor W/2 Levels "F", "M": 2 2 1

$ birthdate:date, Format: "1984-12-29" "1983-05-06" "1986-08-08"

Add New Column

For the student objects that exist, we want to add the age column, which is calculated according to the birthdate. First you need to know how to calculate age. We can use the Date function sys.date () to get the current date, then use the Format function to get the year, and then subtract two years is the age. As if R doesn't provide a few available date functions, we can only use the Format function to take out the year part and then subtract the int type.

Student$age<-as.integer (Format (sys.date (),%Y))-as.integer (Format (student$birthdate, "%Y")

It seems too long to write, and we can use the within function, which is similar to the WITH function mentioned earlier, to omit the variable name, where the within function can modify the variable, which means we add the age column here:

Student<-within (student,{

Age<-as.integer (Format (sys.date (),%Y))-as.integer (Format (birthdate, "%Y")



Querying a date Frame and returning a subset of the conditions that correspond to a table query in the database is a very common operation. The easiest way to get a subset using the row and column index is mentioned earlier. If we use Boolean vectors, we can filter the rows with the which function. For example, if we want to query for all Gender F data, we first get a Boolean vector for student$gender== "F": false true, and then use the which function to return the index of TRUE in the Boolean vector. So our complete query statement is:

Student[which (student$gender== "F"),]

Note that the index is not entered here, if we only want to know the age of all girls, then you can read:

Student[which (student$gender== "F"), "age"]

Such query writing is still complex point, you can directly use the subset function, then the query will be simpler, such as we change the query conditions to the age of <30 women, check the name and age, then the query statement is:

Subset (student,gender== "F" & Age<30, Select=c ("Name", "Age"))

Using SQL to query data Frame

For those of us who have used SQL for years, it would be nice to be able to write SQL statements directly to query the data frame, and the result is really a package: Sqldf.

Also the previous requirement, the corresponding statement is:

Library (SQLDF)

Result<-sqldf ("Select Name,age from student where gender= ' F ' and age<30")


For a database, a join query for multiple tables is a normal thing, so you can also connect multiple data frame in R, which requires the use of the merge function.

For example, in addition to the student object stated earlier, we declare a score variable that records each student's subject and grade:

Score<-data.frame (Sid=c (11,11,12,12,13), Course=c ("math", "English", "math", "Chinese", "math"), Score=c ( 90,80,80,95,96))

Let's look at the contents of the table:

SID Course Score

1 Math 90

2 中文版 80

3 Math 80

4 Chinese 95

5 Math 96

The SID here is the ID inside the student, the equivalent of a foreign key, now to use this ID for the inner JOIN operation, then the corresponding R statement is:

Result<-merge (student,score,by.x= "ID", by.y= "SID")

Let's look at the results after the merge:

ID Name Gender birthdate age Course Score

1 Devin M 1984-12-29 Math 90

2 Devin M 1984-12-29 中文版 80

3 Edward M 1983-05-06 Math 80

4 Edward M 1983-05-06 Chinese 95

5 Wenli F 1986-08-08 Math 96

Join together as we expected.

In addition to join, another operation is the union, which is also a common database operation, then how to join the two columns of data Frame union in R. Although the R language has a union function, but not the meaning of the Union of SQL, we need to use the Rbind function to implement the Union function.

Rbind's two data frame must have the same column, for example, we declare a student2 and rbind two variables:

Student2<-data.frame (Id=c (21,22), Name=c ("Yan", "Peng"), Gender=c ("F", "M"), Birthdate=c ("1982-2-9″," 1983-1-16) , Age=c (32,31))

Rbind (Student,student2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.