"Data mining: R Language Combat" chapter II Data overview

Last Update:2017-03-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2.1 n*m Data Set

In a dataset in the form of N*m, n represents the row of the data, that is, the number of observation points, and M represents the column, that is, the number of variables; N*m is the dimension of the data.

In general, when getting a piece of data, the first thing to do is to look at the number of observations, the number of variables, and the actual meaning of these variables, so as to be aware of the size of the dataset and the relative importance of each variable. This is an important precursor to choosing which data mining algorithm, and how many and which variables and samples should be extracted before that.

2.2 Classification of data2.2.1 General Data classificationQuantitative data: continuous-data and discrete-dataqualitative data: Fixed-class data, fixed-order data, fixed-distance data and fixed-ratio data, the information content in turn increased. data classification of 2.2.2 R

numeric-Numerical Type: Quantitative variable

integer-integer type: contains only integers

logical-logic Type: TRUE and False

character/string-character Type: Each element is a character or string, a class variable

factor-Factor Type: The qualitative data in the quantitative data shell, that is, the character data in the form of digital code, is essentially qualitative data.

> Sex<-factor (C (1,1,0,0,1), levels=c (0,1), labels = c ("Male", "female"))
> Sex
[1] Female female male male female
Levels:male Female

Levels and lables have corresponding relations. However, when the levels or lables is not set, the numeric code for each character is followed by the alphabetical order starting at 1.

> Num<-factor (C ("B", "A", "D", "C"))
> as.numeric (num)
[1] 2 1 4 3

2.2.3 Simple processing of data with R

1. Basic information

> install.packages ("MASS")

> Library (MASS) loads the package containing the data set MASS
> Data ("Insurance") Gets the dataset Insurance
> Dim (Insurance) To get the dimensions of a dataset
[1] 64 5

> Dim (Insurance)[1] The Red section can represent a data frame, including two elements. [] is used to access the element.
[1] 64
> Dim (Insurance) [2]
[1] 5

> names (Insurance) Accessing the row name of the data frame
[1 "" District "" Group "" Age "" Holders "" Claims "

> Head (Names (Insurance), n=2) access to the first two elements
[1] "District" "Group"
> Tail (Names (Insurance)) is accessed by default after six, but only five
[1] "District" "Group" "Age" "Holders" "Claims"
> Tail (Names (Insurance), n=3)
[1] "age" "Holders" "Claims"
> Head (insurance$age)
[1] <25 25-29 30-35 >35 <25 25-29
Levels: <25 < 25-29 < 30-35 < >35

2. Variable type

The class function allows you to identify the type of the variable.

Levels can see the various horizontal values of the factor data and can modify the horizontal values.

Data type judgment: Is.numeric (), Is.integer (), is.logical (), Is.character (), Is.factor ()

Force data type conversions: As.numeric (), As.integer (), as.logical (), As.character (), As.factor ()

2.3 Data sampling and R implementation2.3.1 Simple Random sampling

Sample (X,size,replace=false,prob=null)

X indicates the object to be extracted and is generally expressed in vector form ;

Size is a non-negative integer that indicates the number of samples you want to extract;

Replace indicates whether a sample can be put back, default is not put back;

Pro is used to set the sampling probability for each sampled sample, and the default is no value, i.e. equal probability sampling.

1. Random sampling with back-up (replace=true)

2. Random sampling without back-up (Replace=false)

2.3.2 Stratified Sampling

Install.packages ("sampling")

Library (sampling) needs to load this package

Strata (Data,stratanames=null,size,method=c ("Srswor", "SRSWR", "Poisson", "systematic"), Pik,description=false)

Data is the dataset to be sampled;

The name of the variable under which the layering is placed in the stratanames;

The number of observations to be drawn in each layer of the size, in the order in which the variables appear in the data set; Note: Before using this function, the dataset should be sorted in ascending order by variable;

Method sampling methods, no put back, there is put back, Poisson, system sampling;

PIK is used to set the sampling probability of samples in each layer;

The description is used to choose whether to output a result containing the basic information of each layer.

Sub<-strata (insurance,stratanames = "District", Size=c (1,2,3,4), method = "systematic", Pik = insurance$claims)

According to district, the number of elements extracted in each layer is 1, 2, 3, 4, the probability of element extraction in each layer is consistent with claims, the greater the claims value, the greater the probability of extraction.

GetData (insurance,sub) output specific data information

2.3.3 Sampling of whole group

Install.packages ("sampling")

Library (sampling) needs to load this package

Cluster (data,clustername,size, Method=c ("Srswor", "SRSWR", "Possion", "systemstic", Pik,description=false)

ClusterName refers to the name of the variable used to divide the group;

Size refers to the number of groups that need to be extracted.

Cluster sampling generally requires that each group has a good representation of the overall data, that is, the difference between the samples in the group is big, and the difference between the groups is small.

When the difference between groups is large, cluster sampling often has the disadvantage of not wide sample distribution surface and relatively poor representation of the sample.

2.4 Training set and test set

Training sets are used to build models, and test sets are used to evaluate models.

The proportion of general control training set and test set is about 3:1, this is to make sure the evaluation result of test set is credible if the model training set sample is sufficient.

> Train_sub<-sample (nrow (Insurance), 3/4*nrow (Insurance)) can interpret the red part as a vector of line numbers
> Train_data<-insurance[train_sub,]
> Test_data<-insurance[-train_sub,]
> Dim (Train_data);d im (test_data)
[1] 48 5
[1] 16 5

"Data mining: R Language Combat" chapter II Data overview

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Data mining: R Language Combat" chapter II Data overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Data mining: R Language Combat" chapter II Data overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support