Mahalanobis distance is one of the standardized distance measure in statistics. It is a unit less distance measure introduced by P. C. Mahalanobis in 1936. Here i have using R code and one example for multivariate data sets to find the Mahalanobis distance.
Mahalanobis Distance Formula: ${{D}^{2}}=(x-\mu
{)}'\sum{^{-1}}(x-\mu )$
where,
x - Number of observations
μ - Mean
Σ - Covariance Matrix
Now we go to example program.
First Step:
Using R software and open new script.
Second step:
Import your data set (if your data format xls change to Save As csv format because csv format files are separated by comma, this only for appropriate for r data input type, (its my suggestion only otherwise use any format) )
Now import data using below code
> Input name <- read.csv(file="C:/filename.csv",head=TRUE,sep=",")
(I have save my data files in to C:/ directory so i have using above code, if you have another directory to copy the file path with filename.csv )
(I have save my data files in to C:/ directory so i have using above code, if you have another directory to copy the file path with filename.csv )
> Input name
Example:
Here i have using tobacco data sets for test purpose.
> tobacco <- read.csv(file="C:/tobacco.csv",head=TRUE,sep=",")
> tobacco
BurnRate PercentSugar PercentNicotine
1 1.55 20.05 1.38
2 1.63 12.58 2.64
3 1.66 18.56 1.56
4 1.52 18.56 2.22
5 1.70 14.02 2.85
6 1.68 15.64 1.24
7 1.78 14.52 2.86
8 1.57 18.52 2.18
9 1.60 17.84 1.65
10 1.52 13.38 3.28
11 1.68 17.55 1.56
12 1.74 17.97 2.00
13 1.93 14.66 2.88
14 1.77 17.31 1.36
15 1.94 14.32 2.66
16 1.83 15.05 2.43
17 2.09 15.47 2.42
18 1.72 16.85 2.16
19 1.49 17.42 2.12
20 1.52 18.55 1.87
21 1.64 18.74 2.10
22 1.40 14.79 2.21
23 1.78 18.86 2.00
24 1.93 15.62 2.26
25 1.53 18.56 2.14
> mean<-colMeans(tobacco)
> mean
BurnRate PercentSugar PercentNicotine
1.6880 16.6156 2.1612
> cm<-cov(tobacco)
> cm BurnRate PercentSugar PercentNicotine
BurnRate 0.02787500 -0.1098050 0.01886083
PercentSugar -0.10980500 4.2276840 -0.75646533
PercentNicotine 0.01886083 -0.7564653 0.27466933
> D2<-mahalanobis(tobacco,mean,cm)
> D2
[1] 3.08827463 5.35466197 1.37251420 2.61209613 2.07211223 8.90626020
[7] 1.85354309 1.96263411 1.10087851 7.04624993 1.56621848 0.78813845
[13] 3.37468305 3.77347055 2.78904427 0.99063959 5.87881205 0.08359811
[19] 1.47435780 1.45810005 1.80081271 5.88148893 2.52555955 2.13920930
[25] 2.10664213
> Now you can get the Mahalanobis distance values for further analysis that's all.
> tobacco
BurnRate PercentSugar PercentNicotine
1 1.55 20.05 1.38
2 1.63 12.58 2.64
3 1.66 18.56 1.56
4 1.52 18.56 2.22
5 1.70 14.02 2.85
6 1.68 15.64 1.24
7 1.78 14.52 2.86
8 1.57 18.52 2.18
9 1.60 17.84 1.65
10 1.52 13.38 3.28
11 1.68 17.55 1.56
12 1.74 17.97 2.00
13 1.93 14.66 2.88
14 1.77 17.31 1.36
15 1.94 14.32 2.66
16 1.83 15.05 2.43
17 2.09 15.47 2.42
18 1.72 16.85 2.16
19 1.49 17.42 2.12
20 1.52 18.55 1.87
21 1.64 18.74 2.10
22 1.40 14.79 2.21
23 1.78 18.86 2.00
24 1.93 15.62 2.26
25 1.53 18.56 2.14
> mean<-colMeans(tobacco)
> mean
BurnRate PercentSugar PercentNicotine
1.6880 16.6156 2.1612
> cm<-cov(tobacco)
> cm BurnRate PercentSugar PercentNicotine
BurnRate 0.02787500 -0.1098050 0.01886083
PercentSugar -0.10980500 4.2276840 -0.75646533
PercentNicotine 0.01886083 -0.7564653 0.27466933
> D2<-mahalanobis(tobacco,mean,cm)
> D2
[1] 3.08827463 5.35466197 1.37251420 2.61209613 2.07211223 8.90626020
[7] 1.85354309 1.96263411 1.10087851 7.04624993 1.56621848 0.78813845
[13] 3.37468305 3.77347055 2.78904427 0.99063959 5.87881205 0.08359811
[19] 1.47435780 1.45810005 1.80081271 5.88148893 2.52555955 2.13920930
[25] 2.10664213
> Now you can get the Mahalanobis distance values for further analysis that's all.
Thank you so much for stepwise clarification. It is very useful.
ReplyDelete