2.1 n*m Data Set
In a dataset in the form of N*m, n represents the row of the data, that is, the number of observation points, and M represents the column, that is, the number of variables; N*m is the dimension of the data.
In general, when getting a piece of data, the first thing to do is to look at the number of observations, the number of variables, and the actual meaning of these variables, so as to be aware of the size of the dataset and the relative importance of each variable. This is an important precursor to choosing which data mining algorithm, and how many and which variables and samples should be extracted before that.
2.2 Classification of data2.2.1 General Data classificationQuantitative data: continuous-data and discrete-dataqualitative data: Fixed-class data, fixed-order data, fixed-distance data and fixed-ratio data, the information content in turn increased. data classification of 2.2.2 R
numeric-Numerical Type: Quantitative variable
integer-integer type: contains only integers
logical-logic Type: TRUE and False
character/string-character Type: Each element is a character or string, a class variable
factor-Factor Type: The qualitative data in the quantitative data shell, that is, the character data in the form of digital code, is essentially qualitative data.
> Sex<-factor (C (1,1,0,0,1), levels=c (0,1), labels = c ("Male", "female"))
> Sex
[1] Female female male male female
Levels:male Female
Levels and lables have corresponding relations. However, when the levels or lables is not set, the numeric code for each character is followed by the alphabetical order starting at 1.
> Num<-factor (C ("B", "A", "D", "C"))
> as.numeric (num)
[1] 2 1 4 3
2.2.3 Simple processing of data with R
1. Basic information
> install.packages ("MASS")
> Library (MASS) loads the package containing the data set MASS
> Data ("Insurance") Gets the dataset Insurance
> Dim (Insurance) To get the dimensions of a dataset
[1] 64 5
> Dim (Insurance)[1] The Red section can represent a data frame, including two elements. [] is used to access the element.
[1] 64
> Dim (Insurance) [2]
[1] 5
> names (Insurance) Accessing the row name of the data frame
[1 "" District "" Group "" Age "" Holders "" Claims "
> Head (Names (Insurance), n=2) access to the first two elements
[1] "District" "Group"
> Tail (Names (Insurance)) is accessed by default after six, but only five
[1] "District" "Group" "Age" "Holders" "Claims"
> Tail (Names (Insurance), n=3)
[1] "age" "Holders" "Claims"
> Head (insurance$age)
[1] <25 25-29 30-35 >35 <25 25-29
Levels: <25 < 25-29 < 30-35 < >35
2. Variable type
The class function allows you to identify the type of the variable.
Levels can see the various horizontal values of the factor data and can modify the horizontal values.
Data type judgment: Is.numeric (), Is.integer (), is.logical (), Is.character (), Is.factor ()
Force data type conversions: As.numeric (), As.integer (), as.logical (), As.character (), As.factor ()
2.3 Data sampling and R implementation2.3.1 Simple Random sampling
Sample (X,size,replace=false,prob=null)
X indicates the object to be extracted and is generally expressed in vector form ;
Size is a non-negative integer that indicates the number of samples you want to extract;
Replace indicates whether a sample can be put back, default is not put back;
Pro is used to set the sampling probability for each sampled sample, and the default is no value, i.e. equal probability sampling.
1. Random sampling with back-up (replace=true)
2. Random sampling without back-up (Replace=false)
2.3.2 Stratified Sampling
Install.packages ("sampling")
Library (sampling) needs to load this package
Strata (Data,stratanames=null,size,method=c ("Srswor", "SRSWR", "Poisson", "systematic"), Pik,description=false)
Data is the dataset to be sampled;
The name of the variable under which the layering is placed in the stratanames;
The number of observations to be drawn in each layer of the size, in the order in which the variables appear in the data set; Note: Before using this function, the dataset should be sorted in ascending order by variable;
Method sampling methods, no put back, there is put back, Poisson, system sampling;
PIK is used to set the sampling probability of samples in each layer;
The description is used to choose whether to output a result containing the basic information of each layer.
Sub<-strata (insurance,stratanames = "District", Size=c (1,2,3,4), method = "systematic", Pik = insurance$claims)
According to district, the number of elements extracted in each layer is 1, 2, 3, 4, the probability of element extraction in each layer is consistent with claims, the greater the claims value, the greater the probability of extraction.
GetData (insurance,sub) output specific data information
2.3.3 Sampling of whole group
Install.packages ("sampling")
Library (sampling) needs to load this package
Cluster (data,clustername,size, Method=c ("Srswor", "SRSWR", "Possion", "systemstic", Pik,description=false)
ClusterName refers to the name of the variable used to divide the group;
Size refers to the number of groups that need to be extracted.
Cluster sampling generally requires that each group has a good representation of the overall data, that is, the difference between the samples in the group is big, and the difference between the groups is small.
When the difference between groups is large, cluster sampling often has the disadvantage of not wide sample distribution surface and relatively poor representation of the sample.
2.4 Training set and test set
Training sets are used to build models, and test sets are used to evaluate models.
The proportion of general control training set and test set is about 3:1, this is to make sure the evaluation result of test set is credible if the model training set sample is sufficient.
> Train_sub<-sample (nrow (Insurance), 3/4*nrow (Insurance)) can interpret the red part as a vector of line numbers
> Train_data<-insurance[train_sub,]
> Test_data<-insurance[-train_sub,]
> Dim (Train_data);d im (test_data)
[1] 48 5
[1] 16 5
"Data mining: R Language Combat" chapter II Data overview