The creation data set of "R Language Combat" (chapter II, various data structures)

Source: Internet
Author: User
Tags scalar

Data Set 2.1 DataSet concepts

Concept: A rectangular data that is usually composed of data

Different industries have different names for the rows and columns of a dataset

Industry people Yes Column
Statistical biologist Observations (observation) Variable (variable)
Database Analyst Records (record) Fields (field)
In the research of data mining and machine learning Example (example) Properties (Attribute)

Data types (patterns) that can be processed: numeric, character, logical, complex, primitive (bytes)

Structure of stored data: scalar, vector, data, data frame, and list

Identifier of the instance: Rownames (row name); class type of instance: Factor (factors)

2.2 Data structures

This section describes a few data structures, vectors, matrices, arrays, data frames, the first three kinds are one-dimensional, two-dimensional, more than two-dimensional, they are in a data structure, can only use one data mode, and data frame can be a variety of patterns.

Some definitions

Object: Anything that can be copied to a variable, including constants, data structures, functions, graphics

Patterns: Describes how objects are stored and a variety of

Data frame: A structure that stores data (columns represent variables, rows represent observations), and a data frame can store variables of different types (such as numeric, character)

2.2.1 Vector (one-dimensional data, numeric, character, logical)
a<-C (All-in-one) #数值型b <-c ("One", "one", "three") #字符型c <-c (true,ture,false) #逻辑型

Note:1. The character vector, the element to add "" or ", numerical and logical type is not required.

2. The same vector, only one pattern of data;

3. Scalar is a vector that contains only one element

#标量是 vector with only one element f<-1g<-"US" h<-true

Square brackets: The position value of the element, specifically how to access the elements in the vector, see the following code

> a<-c ("K", "J", "H", "A", "C", "M") #生成一个向量 > A[3]  #向量a的第三个元素 [1] "H" > A[c (1,3,5)] #向量a的第1个, 3rd, 5th element [1] "K "H" "C" > A[2:6]  #生成一个数值序列, the element from 2nd to 6th of vector A. Equivalent to a ([1] "J" "H" "a" "C" "M" #两种方式生成的向量a一样 > A<-c (2:6) > A[1] 2 3 4 5 6> a<-c (2,3,4,5,6) > A[1] 2 3 4 5 6
2.2.2 Matrices (two-dimensional numerical, character, numeric, logical)

Note: Only one data type can be included in a matrix

function Matrix ()

Role: Create a matrix

Format: Myymatrix <-matrix (vector, nrow=number_of_rows, Ncol=number_of_columns, Byrow=logical_value, Dimnames=list ( Char_vector_rownames, Char_vector_colnames))

wherein, the elements of the vector--matrix, Nrow, ncol--respectively the number of rows and columns of the dimension, dimnames--optional, a character vector representation of the row and column names; byrow--Matrix row-row padding (Byrow = TRUE) or column-filled (byrow = FALSE), by default, by column.

Matrix usage Examples

Eg1. Creates a matrix with an element of size 5*4 1 through 20, arranged by column by default.

> Y<-matrix (1:20,nrow=5,ncol=4) > Y     [, 1] [, 2] [, 3] [, 4][1,]    1    6   16[2,]    2    7   17[3,]    3    8   18[4,]    4    9   19[5,]    5   15   20

EG2.

> Cells <-C (1,26,24,68) > Rnames<-c ("R1", "R2") > Cnames<-c ("C1", "C2") #按列排列 (also default) > mymatrix< -matrix (Cells,nrow=2,ncol=2,byrow=false,dimnames=list (rnames,cnames)) > Mymatrix   C1 c2r1  1 24R2 26 68# Arrange By row > Mymatrix<-matrix (cells,nrow=2,ncol=2,byrow=true,dimnames=list (rnames,cnames)) > Mymatrix   C1 C2R1  1 26R2 24 68

Select the elements in the matrix :

X[i,]: line I in the Matrix; X[,j]: Column J in the Matrix; X[i,j]: Line I is the J column element

Select multiple rows or columns, subscript I and J can be numeric vectors

Example:

> X<-matrix (1:10,nrow=2) > x     [, 1] [, 2] [, 3] [, 4] [, 5][1,]    1    3    5    7    9[2,]    2    4    6    8   10> x[2,][1]  2  4  6  8 10> x[,2][1] 3 4> x[1,4]  #第1行的第4各个元素 [1] 7 > X[1,c (4,5)] #第1行的, 4th element and 5th element [1] 7 9

2.2.3 Array (dimension can be greater than 2)

Note: Data in an array can only have one pattern

How to create: Array ()

Myaaray <-Array (vector, dimensions, dimnames)

Among them, vector-the data in the array, dimensions-numeric vector, gives the maximum value of each dimension; dimnames--the list of optional, dimension name labels .

eg. creating a three-dimensional (2*3*4) numeric array

> dim1<-c ("A1", "A2") > Dim2<-c ("B1", "B2") > Dim2<-c ("B1", "B2", "B3") > Dim3<-c ("C1", "C2", "C3", "C4") > Z<-array (1:24,c (2,3,4), List (dim1,dim2,dim3)) > z, C1   B1 B2 b3a1  1  3  5a2  2  4  6,, C2   B1 B2 b3a1  7  9 11a2  8, C3 B1 B2 b3a1 (17A2), C4 B1 B2   b3a1 20 23a2 22 24> Z<-array (1:24,c (2,3,4), Dimnames=list (dim1,dim2,dim3)) > z, C1   B1 B2 b3a1  1  3  5a2  2  4  6, C2   B1 B2 b3a1  7  9 11a2  8,, C3 B1   B2 b3a1 (17A2), C 4   B1 B2 b3a1 23a2 20 22 24

The element is selected in a similar way to live, for example: z[1,2,3] is 15.

2.2.4 Data frame (can contain different modes(Numeric type, character typewait) (The data)

Note: You can put data in multiple modes into a matrix, but the data pattern for each column must be unique, and the different column patterns can be different

Create function Data.frame ()

MyData <-data.frame (col1, col2, col3)

Where the column vector col1,col2,col3 can be any type (such as character, numeric, or logical)

> patientid<-c (1,2,3,4) > Age<-c (25,34,28,52) > Diabetes<-c ("Type1", "Type2", "Type1", "Type2") > Status<-c ("Poor", "improved", "excellent", "Poor") > Patientdata<-data.frame (patientid,age,diabetes,status ) > Patientdata  patientid Age Diabetes    status1         1    Type1      Poor2         2    Type2  Improved3         3    Type1 Excellent4         4    Type2      Poor

Select the element in the data frame : 1. Use the marker number. 2. Specify the list directly. 3.$: Select a specific variable for a given data frame

> Patientdata[1:2]  patientid age1         1  252         2  343         3  284         4  52> patientdata[c (1:3)]  Patientid Age diabetes1         1    Type12         2    Type23         3    Type14         4  the    Type2> patientdata[c ("Diabetes", "status")]  diabetes    status1 Type1 Poor2    Type2  Improved3    Type1 Excellent4    Type2      poor> Patientdata$age[1] 25 34 28 52

To generate a diabetes and Status column table

        Excellent improved Poor  Type1         1        0    1  Type2         0        1    1

A simpler way to call variables than data frame $ variable names is with attach () and detach (), and with

1.attach (), detach () and with ()

Attach () can add a data frame to R's search path, and with it, when calling a variable in a data frame, you don't need to tell r what data frame the variable is now calling.

Detach () removes the data frame from the search path.

Attach () and detach () are like a pair of brothers, but in fact, detach does not work on the data itself and can be omitted.

Summary (MTCARS$DISP) plot (mtcars$mpg,mtcars$disp) plot (MTCARS$MPG,MTCARS$WT)

Equivalent to

Attach (Mtcars) summary (MPG) plot (mpg,disp) plot (MPG,WT) detach (Mtcars)

limitations: When there are more than one object with the same name, a problem occurs with attach (). The original object will take precedence, and later objects will be masked (masked).

With () How to get the same result as the above code, see the following code

With (mtcars,{  print (Summary (MPG))  plot (Mpg,disp)  plot (MPG,WT)})
With (Mtcars,{print (summary (MPG))
Plot (Mpg,disp)
Plot (MPG,WT)})

Note that there is no comma in the curly brackets, to break the line, I run my own, no line break can not be achieved. Curly brace statements are for the data frame mtcars, if there is only one statement in the curly braces, the curly braces can be omitted

limitation: assignment is left only within the parentheses of this function.

Improvement: instead of <-with special Replicator <<-, you can save the object to a global environment outside of the.

> With (mtcars,{nokeepstates<-summary (MPG) + keepstates<<-summary (MPG)}) > Nookeepstateserror:object ' Nookeepstates ' not found> keepstates   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   10.40   15.43   19.20   20.09   22.80   

The result is self-evident, because Keepstates is saved to a global environment other than with (), and nookeepstates does not, so when left with (), only keepstates exists.

2. Strength identifiers

By row.names= A variable to specify an instance identifier, my understanding is that we have a role in the school number, the job number.

Patientdata<-data.frame (Patientid,age,diabetes,status,row.names=patientid) Specifying Patientid as the variable used to mark various printouts and instance names in the graph in R (this is the exact words of the book), I understand that Patientid is the only variable in the data frame that can identify the identity. Each instance, or observation, is unique.

2.2.5 Factor ( determine how data is analyzed and how it is visually presented
Variable
Nominal variable Categorical variables with no order Factor
Ordered variables Have a sequential relationship, no quantity relationship Factor
Continuous type variable There are also sequences and variables ----

Function: Factor ()

Function: Stores the class value as an integer vector, starting with 1, and mapping an internal vector consisting of a string (the original value) to those integers.

Convert raw values to numeric variables

  • Nominal variable---> stored as an integer vector
    Disabetes <-C ("Type1", "Type2", "Typye1", "Type1")
    > diabetes<-factor (Diabetes) #将向量diabetes存储为 (1,2,1,1) > Diabetes      #关联关系为1 =type1,2=type2 ( assignment based on alphabetical order ) [1] Type1 Type2 Type1 type2levels:type1 type2> str (diabetes) Factor W/2 levels "Type1", "Type2": 1 2 1 2

    Note: Any analysis of diabetes will be used as a nominal vector pair and automatically select a statistical method suitable for this measurement scale.

  • ordered variable----> stored as an integer vector (in factor () function Chinese Medicine add parameter ordered=true)
    > status<- C ("Poor", "improved", "excellent", "Poor") > Status[1] "Poor" "Improved" "excellent" "Poor" > Status<-facto R (status,ordered=true,levels = C ("Poor", "improved", "excellent")) # levels overrides the default sort > str (status) ord.factor W/3 levels "Poor" < "improved" <.: 1 2 3 1 

    Note: Any analysis for this variable is treated as an ordered variable and automatically selects the appropriate statistical method

  • numeric variables (parameters levels and labels required)
    Suppose men encode 1, women encode 2.
    > sex<-c (1,1,2) > sex<-factor (sex,levels = C (UP), labels = c ("Male", "Female")) > str (sex) factor W/2 levels "Male", "Female": 1 1 2

    Note: The order of the labels labels = c ("Male", "Female") and horizontally consistent levels = C (1, 2)
    The label "Male" and "Female" will replace 1 and 2 in the result type output, instead of 1 or 2 of the gender variables will be treated as missing values.

    > Sex<-c > Sex<-factor (sex,levels = C (UP), labels = c ("Male", "Female")) > str (sex) factor W/2 levels "Male", "Female": 1 2 NA

See below how common and ordered factors affect data analysis

#以向量形式输入
Patientid <-C (1, 2, 3, 4) Age <-C (,,) diabetes <-C ("Type1", "Type2", "Type1", "Type1") status <-C ("Poor", "improved", "excellent", "Poor") #将diabetes指定为普通因子
Diabetes <-Factor (diabetes)
#将status指定为有序型因子status <-factor (status, Order=true)
#将数据合并为数据框patientdata <-Data.frame (Patientid, age, diabetes, status)
#str (object) Displays the result of an object, providing information about an object in R (this example is a data frame) str (patientdata)
$summary () treats each variable differently, showing the statistical summary of the object Summary (patientdata)

After running

> str (patientdata) ' Data.frame ': 4 obs. of  4 variables: $ patientid:num  1 2 3 4 $ age      : Num  25 34 28 52 $ diabetes:factor W/2 Levels "Type1", "Type2": 1 2 1 2 $ status   : Factor W/3 Levels "excellent", "Improved",..: 3 2 1 3> Summary (patientdata)   Patientid         age         Diabetes       status  Min.   : 1.00   Min.   : 25.00   type1:2   excellent:1   1st qu.:1.75   1st qu.:27.25   type2:2   improved:1   median:2.50   median:31.00             Poor     : 2   Mean   : 2.50   Mean   : 34.75                           3rd qu.:3.25   3rd qu.:38.50                           Max.   : 4.00   Max.   : 52.00                          

After running Str (), it is clearly shown that diabetes is a factor, status is an ordered factor, and how the data frame is encoded internally

After running summary (), the individual variable differences are treated to show the minimum, maximum, mean, and four-digit digits of the continuous variable age. The two factors, diabetes and status (each level), show the frequency value.

2.2.6 List

Definition: A collection of objects (or components). Allows the consolidation of several (possibly unrelated) objects to a single object name.

Therefore, an object can be a combination of several vectors, matrices, data frames, and even other lists.

Functions for creating lists: List ()

MyList <-list (Object1, Object2,...)

Name of the object in the list: mylist<-list (name1 = Object1,name2 = object2)

g<-"My first List" #字符串h <-c (25,26,18,39) #数值型向量j <-matrix (1:10,nrow=5) #5 The Matrix k<-c ("One", "one", "one", "three") # Character Vector mylist<-list (title=g,h,j,k) #创建列表, where the first object is named title

> Mylist$title[1] "My first List" [[2]][1] [39[[3]]     [, 1] [, 2][1,]    1    6[2,]    2    7[3,]    3< C6/>8[4,]    4    9[5,]    5   10[[4]][1] "one"   "one" "the" "   three"

Accessing elements in a list 1. The number of a component can be lethal with a double brace. 2. By name

> Mylist[[1]][1] "My first list" > mylist[["title"]][1] "My first list" > Mylist$title #要命名了才可以 [1] "My first list"

These are the basic data structures.

The creation data set of "R Language Combat" (chapter II, various data structures)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.