"Data mining: R Language Combat" chapter II Data overview

Source: Internet
Author: User

2.1 n*m Data Set

In a dataset in the form of N*m, n represents the row of the data, that is, the number of observation points, and M represents the column, that is, the number of variables; N*m is the dimension of the data.

In general, when getting a piece of data, the first thing to do is to look at the number of observations, the number of variables, and the actual meaning of these variables, so as to be aware of the size of the dataset and the relative importance of each variable. This is an important precursor to choosing which data mining algorithm, and how many and which variables and samples should be extracted before that.

2.2 Classification of data2.2.1 General Data classificationQuantitative data: continuous-data and discrete-dataqualitative data: Fixed-class data, fixed-order data, fixed-distance data and fixed-ratio data, the information content in turn increased. data classification of 2.2.2 R

numeric-Numerical Type: Quantitative variable

integer-integer type: contains only integers

logical-logic Type: TRUE and False

character/string-character Type: Each element is a character or string, a class variable

factor-Factor Type: The qualitative data in the quantitative data shell, that is, the character data in the form of digital code, is essentially qualitative data.

> Sex<-factor (C (1,1,0,0,1), levels=c (0,1), labels = c ("Male", "female"))
> Sex
[1] Female female male male female
Levels:male Female

Levels and lables have corresponding relations. However, when the levels or lables is not set, the numeric code for each character is followed by the alphabetical order starting at 1.

> Num<-factor (C ("B", "A", "D", "C"))
> as.numeric (num)
[1] 2 1 4 3

2.2.3 Simple processing of data with R

1. Basic information

> install.packages ("MASS")

> Library (MASS) loads the package containing the data set MASS
> Data ("Insurance") Gets the dataset Insurance
> Dim (Insurance) To get the dimensions of a dataset
[1] 64 5

> Dim (Insurance)[1] The Red section can represent a data frame, including two elements. [] is used to access the element.
[1] 64
> Dim (Insurance) [2]
[1] 5

> names (Insurance)         Accessing the row name of the data frame
[1 "" District "" Group "" Age "" Holders "" Claims "

> Head (Names (Insurance), n=2) access to the first two elements
[1] "District" "Group"
> Tail (Names (Insurance)) is accessed by default after six, but only five
[1] "District" "Group" "Age" "Holders" "Claims"
> Tail (Names (Insurance), n=3)
[1] "age" "Holders" "Claims"
> Head (insurance$age)
[1] <25 25-29 30-35 >35 <25 25-29
Levels: <25 < 25-29 < 30-35 < >35

2. Variable type

The class function allows you to identify the type of the variable.

Levels can see the various horizontal values of the factor data and can modify the horizontal values.

Data type judgment: Is.numeric (), Is.integer (), is.logical (), Is.character (), Is.factor ()

Force data type conversions: As.numeric (), As.integer (), as.logical (), As.character (), As.factor ()

2.3 Data sampling and R implementation2.3.1 Simple Random sampling

Sample (X,size,replace=false,prob=null)

X indicates the object to be extracted and is generally expressed in vector form ;

Size is a non-negative integer that indicates the number of samples you want to extract;

Replace indicates whether a sample can be put back, default is not put back;

Pro is used to set the sampling probability for each sampled sample, and the default is no value, i.e. equal probability sampling.

1. Random sampling with back-up (replace=true)

2. Random sampling without back-up (Replace=false)

2.3.2 Stratified Sampling

Install.packages ("sampling")

Library (sampling) needs to load this package

Strata (Data,stratanames=null,size,method=c ("Srswor", "SRSWR", "Poisson", "systematic"), Pik,description=false)

Data is the dataset to be sampled;

The name of the variable under which the layering is placed in the stratanames;

The number of observations to be drawn in each layer of the size, in the order in which the variables appear in the data set; Note: Before using this function, the dataset should be sorted in ascending order by variable;

Method sampling methods, no put back, there is put back, Poisson, system sampling;

PIK is used to set the sampling probability of samples in each layer;

The description is used to choose whether to output a result containing the basic information of each layer.

Sub<-strata (insurance,stratanames = "District", Size=c (1,2,3,4), method = "systematic", Pik = insurance$claims)

According to district, the number of elements extracted in each layer is 1, 2, 3, 4, the probability of element extraction in each layer is consistent with claims, the greater the claims value, the greater the probability of extraction.

GetData (insurance,sub) output specific data information

2.3.3 Sampling of whole group

Install.packages ("sampling")

Library (sampling) needs to load this package

Cluster (data,clustername,size, Method=c ("Srswor", "SRSWR", "Possion", "systemstic", Pik,description=false)

ClusterName refers to the name of the variable used to divide the group;

Size refers to the number of groups that need to be extracted.

Cluster sampling generally requires that each group has a good representation of the overall data, that is, the difference between the samples in the group is big, and the difference between the groups is small.

When the difference between groups is large, cluster sampling often has the disadvantage of not wide sample distribution surface and relatively poor representation of the sample.

2.4 Training set and test set

Training sets are used to build models, and test sets are used to evaluate models.

The proportion of general control training set and test set is about 3:1, this is to make sure the evaluation result of test set is credible if the model training set sample is sufficient.

> Train_sub<-sample (nrow (Insurance), 3/4*nrow (Insurance)) can interpret the red part as a vector of line numbers
> Train_data<-insurance[train_sub,]
> Test_data<-insurance[-train_sub,]
> Dim (Train_data);d im (test_data)
[1] 48 5
[1] 16 5

"Data mining: R Language Combat" chapter II Data overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.