Linear discriminant analysis in R language

Source: Internet
Author: User

in the R language, linear discriminant analysis (Liner discriminant analyses, or LDA) is implemented by using thelinear discriminant function Lqa () in the package mass. The function has three invocation formats:

1) When the object is data.frame the data frame

LDA (X,grouping,prior = Propotions,tol = 1.0E-4,METHOD,CV = False,nu,...)

2) When the object is a formula formula

LDA (Formula,data,..., subnet,na.action)

3) When the object is matrix

LDA (X,group,..., subnet,na.action)

For the first case, grouping represents the category of each observed sample;

The prior represents various kinds of prior probabilities, and the proportion of each sample in the training set is taken by default;

Tol represents a filter variable, which defaults to 0.0001

For the second case, formula represents a discriminant formula, such as Y~X1+X2+X3, or y~x1*x1

Data represents the dataset

Subnet represents the sample

Na.action represents a method for handling missing values, which defaults to "the LDA () function cannot run if there are missing values in the sample", or if set to Na.omit, which means "automatically delete missing values from the sample and then calculate"

for the third case, x represents the Matrix                 

Data represents the dataset

Subnet represents the sample

Na.action represents a method for handling missing values, which defaults to "the LDA () function cannot run if there are missing values in the sample", or if set to Na.omit, which means "automatically delete missing values from the sample and then calculate"

Below, an example is given to illustrate the linear discriminant analysis. We use the Miete dataset in the KKNN package for the algorithm demonstration. The Miete data set records some interesting variables in the 1994 Housing Commission standard in Munich, such as the size of the house, whether there is a bathroom, whether central heating is available, whether hot water is supplied, and so on, all of which affect the Commission.

1. Data overview

First, take a quick look at the Miete dataset.

> Library (KKNN) > Data (Miete) > head (miete)

NM WFL BJ bad0 zh ww0 badkach Fenster kueche mvdauer bjkat Wflkat
1 693.29 50 1971.5 0 1 0 0 0 0 2 4 1
2 736.60 70 1971.5 0 1 0 0 0 0 26 4 2
3 732.23 50 1971.5 0 1 0 0 0 0 1 4 1
4 1295.14 55 1893.0 0 1 0 0 0 0 0 1 2
5 394.97 46 1957.0 0 0 1 0 0 0 27 3 1
6 1285.64 94 1971.5 0 1 0 1 0 0 2 4 3
NMQM Rooms Nmkat ADR Wohn
1 13.865800 1 3 2 2
2 10.522857 3 3 2 2
3 14.644600 1 3 2 2
4 23.548000 3 5 2 2
5 8.586304 3 1 2 2
6 13.677021 4 5 2 2

> Dim (Miete)

[1] 1082

As we can see, there are 1082 samples and 17 variables in the dataset. Below, we use summary () to view the information of the variable.

> Summary (miete)
nm WFL BJ bad0 zh ww0

Min.   : 127.1 Min.   : 20.00 Min. : 1800 0:1051 0:202 0:1022
1st Qu.: 543.6 1st Qu.: 50.25 1st qu.:1934 1:31 1:880 1:60
median:746.0 median:67.00 median:1957
mean:830.3 mean:69.13 mean:1947
3rd qu.:1030.0 3rd Qu.: 84.00 3rd qu.:1972
Max.   : 3130.0 Max.   : 250.00 Max. : 1992
Badkach Fenster kueche mvdauer bjkat wflkat NMQM
0:446 0:1024 0:980 Min.   : 0.00 1:218 1:271 Min. : 1.573
1:636 1:58 1:102 1st Qu.: 2.00 2:154 2:513 1st Qu.: 8.864
median:6.00 3:341 3:298 median:12.041
mean:10.63 4:226 mean:12.647
3rd qu.:17.00 5:79 3rd qu.:16.135
Max.   : 82.00 6:64 Max. : 35.245
Rooms Nmkat ADR Wohn
Min. : 1.000 1:219 1:25 1:90
1st qu.:2.000 2:230 2:1035 2:673
median:3.000 3:210 3:22 3:319
mean:2.635 4:208
3rd qu.:3.000 5:215
Max. : 9.000

We can choose Nmkat (NET rent) as the variable to be identified---one, because the variable is susceptible to other variables in the meaning of the variable, as interpreted variables; second, Nmkat itself has 5 hierarchical categories, the corresponding sample size is 219, 230, 210, 208, 215, that is, the sample size of each class is about 200, the distribution is more uniform.

2. Data preprocessing

Below, we will miete data sets into training sets and test sets .

In order to improve the discriminant effect, we consider the method of stratified sampling , because of the miete data set, the 5 rank distribution of the discriminant variable Nmkat is more uniform, so we use 5 classes to extract samples in equal quantities. (If the distribution is uneven, the sample is taken proportionally). Specific as follows:

> Library (sampling) > n = Round (2/3*nrow (Miete)/5) > N
[1] 144

As you can see, the training set accounts for 2/3 of the total sample, the test set represents 1/3 of the total sample, and the number of samples taken from each level of the Nmkat variable in the training set is 144.

> #以nmkat变量的5个等级划分层次, stratified sampling > Sub_train = Strata (miete,stratanames = "Nmkat", Size=rep (n,5), method= "Srswor") > Head (Sub_train)

#显示训练集抽取的情况, including the value of the Nmkat variable, the ordinal of the sample in the dataset, the probability of being extracted, and the level of abstraction.

Nmkat id_unit Prob stratum
1 3 1) 0.6857143 1
2 3 2) 0.6857143 1
3 3 3) 0.6857143 1
16 3 16) 0.6857143 1
20 3 20) 0.6857143 1
22 3 22) 0.6857143 1

>

> #获取如上ID_unit所对应的样本构成训练集, and delete variables 1, 3, 12> data_train = GetData (Miete[,c ( -1,-3,-12)],sub_train$id_unit) > Data_ Test = GetData (Miete[,c ( -1,-3,-12)],-sub_train$id_unit) > Dim (data_train); Dim (data_test) #分别显示训练集, the dimensions of the test set

[1] 720 14
[1] 362 14

> Head (data_test)

WFL bad0 zh ww0 badkach Fenster kueche mvdauer Bjkat nmqm rooms Nmkat
7 28 0 1 0 0 1 1 9 4 17.011071 1 1
8 36 0 1 0 0 0 1 3 4 19.710278 1 3
9 33 0 1 0 0 0 0 1 4 25.840606 1 4
10 57 0 1 0 1 0 1 9 6 11.534035 2 2
11 75 0 1 0 1 0 1 3 6 16.504533 3 5
17 79 0 1 0 0 0 0 20 4 7.507215 3 2
ADR Wohn
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
17 2 2

At this point, the data understanding and data preprocessing process is finished, and the training set Data_train and test set Data_test can be used directly.

3. Linear discriminant
The formula formula format is used here to discriminate. The first thing is to load the package mass, then use Nmkat as the discriminant variable, and the other variable as the characteristic variable, according to the formula nmkat~. ( if the variable is y, the formula is y~. ), using the training set Data_train to run the LDA () function.

1 2 3) 4 5

0.2 0.2 0.2 0.2 0.2

We can see that each kind of prior probability is 0.2 in 5 levels, and 1, that is, they are equal, which corresponds to each other evenly.

> Fit_lda1$means

WFL bad01 zh1 ww01 badkach1 fenster1 kueche1
1 54.87500 0.055555556 0.6041667 0.138888889 0.3888889 0.06944444 0.04166667
2 60.59722 0.013888889 0.8125000 0.027777778 0.5486111 0.08333333 0.04166667
3 66.76389 0.013888889 0.8194444 0.041666667 0.5208333 0.06944444 0.07638889
4 74.70833 0.013888889 0.8750000 0.041666667 0.6458333 0.03472222 0.07638889
5 90.10417 0.006944444 0.9375000 0.006944444 0.7708333 0.04166667 0.20833333
Mvdauer Bjkat. L Bjkat. Q Bjkat. C bjkat^4 bjkat^5
1 14.444444-0.21580517-0.104562739 0.031056500 0.17323372-0.17585847
2 11.923611-0.12450298-0.211398581-0.002588042 0.20604313-0.12642562
3 11.847222-0.12782306-0.145478593 0.049690399 0.16273470-0.07349309
4 10.347222-0.08964215-0.127293769-0.035197366 0.12861291 0.01137393
5 5.333333-0.04482107-0.009092412 0.018633900 0.02624753-0.01574852
NMQM rooms ADR. L ADR. Q Wohn. L Wohn. Q
1 8.231574 2.173611-0.019641855-0.7654655 0.03437325-0.3997431
2 10.988627 2.416667-0.034373246-0.7569604 0.08838835-0.4252586
3 12.495436 2.597222-0.019641855-0.7654655 0.11294067-0.3487121
4 14.107342 2.861111 0.004910464-0.7909811 0.16695577-0.3912379
5 17.108865 3.250000 0.019641855-0.7484552 0.27498597-0.2041241

From the above results, you can see some of the characteristics of the data that can reflect the real situation. For example, the housing area WFL variable, it obviously increases with the rent nmkat rise gradually. This with our common sense "the larger the area of the house, the more expensive the rent" is very consistent.


Linear discriminant analysis in R language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.