May 2014 ~ Research Mining

Saturday 3 May 2014

Univariate Outlier Detection Based On Normal Distribution

May 03, 2014Normal Distribution, Outlier Detection, Outliers, Univariate Outlier 4 comments

Detection of Univariate Outlier Based On Normal Distribution
Data involving only one attribute or variable are called univariate data. For simplicity, we often choose to assume that data are generated from a normal distribution. We can then learn the parameters of the normal distribution from the input data, and identify the points with low probability as outliers.
Let’s start with univariate data. We will try to detect outliers by assuming the data follow a normal distribution.
Univariate outlier detection using maximum likelihood:

Suppose a city’s average temperature values in July in the last 10 years are, in value-ascending order, 24.0°C, 28.9°C, 28.9°C, 29.0°C, 29.1°C, 29.1°C, 29.2°C, 29.2°C, 29.3°C and 29.4°C. Let’s assume that the average temperature follows a normal distribution, which is determined by two parameters: the mean, μ, and the standard deviation, σ.
We can use the maximum likelihood method to estimate the parameter μ and σ. That is, we maximize the log-likelihood function

Where n is the total number of samples, which is 10 in this sample.
Taking derivatives with respect to μ and σ2 and solving the result system of first order conditions leads to the following maximum likelihood estimates:

In this example, we have

Accordingly, we have .

The most dividing value, 24.0ºC, is 4.61ºC away from the estimated mean. We know that the region contains 99.7% data under the assumption of normal distribution. Because

the probability that the value 24.0ºC is generated by the normal distribution is less than 0.15%, and thus can be identified as an outlier.

Thursday 1 May 2014

Mahalanobis Distance using R code

May 01, 2014Mahalanobis distance, R code, R-Code Script 1 comment

Mahalanobis distance is one of the standardized distance measure in statistics. It is a unit less distance measure introduced by P. C. Mahalanobis in 1936. Here i have using R code and one example for multivariate data sets to find the Mahalanobis distance.

Mahalanobis Distance Formula: ${{D}^{2}}=(x-\mu {)}'\sum{^{-1}}(x-\mu )$

where,

x - Number of observations

μ - Mean

Σ - Covariance Matrix

Now we go to example program.

First Step:

Using R software and open new script.

Second step:

Import your data set (if your data format xls change to Save As csv format because csv format files are separated by comma, this only for appropriate for r data input type, (its my suggestion only otherwise use any format) )

Now import data using below code

> Input name <- read.csv(file="C:/filename.csv",head=TRUE,sep=",")
(I have save my data files in to C:/ directory so i have using above code, if you have another directory to copy the file path with filename.csv )

> Input name

Example:

Here i have using tobacco data sets for test purpose.

> tobacco <- read.csv(file="C:/tobacco.csv",head=TRUE,sep=",")
> tobacco
   BurnRate PercentSugar PercentNicotine
1       1.55        20.05            1.38
2     1.63        12.58            2.64
3       1.66        18.56            1.56
4     1.52        18.56            2.22
5       1.70        14.02            2.85
6     1.68        15.64            1.24
7       1.78        14.52            2.86
8       1.57        18.52            2.18
9       1.60        17.84            1.65
10     1.52        13.38            3.28
11     1.68        17.55            1.56
12     1.74        17.97            2.00
13     1.93        14.66            2.88
14     1.77        17.31            1.36
15     1.94        14.32            2.66
16     1.83        15.05            2.43
17     2.09        15.47            2.42
18     1.72        16.85            2.16
19     1.49        17.42            2.12
20     1.52        18.55            1.87
21     1.64        18.74            2.10
22     1.40        14.79            2.21
23     1.78        18.86            2.00
24     1.93        15.62            2.26
25     1.53        18.56            2.14
> mean<-colMeans(tobacco)
> mean
         BurnRate    PercentSugar PercentNicotine
         1.6880         16.6156          2.1612
> cm<-cov(tobacco)
> cm                            BurnRate    PercentSugar PercentNicotine
BurnRate            0.02787500   -0.1098050      0.01886083
PercentSugar    -0.10980500    4.2276840     -0.75646533
PercentNicotine 0.01886083   -0.7564653      0.27466933
> D2<-mahalanobis(tobacco,mean,cm)
> D2
[1] 3.08827463 5.35466197 1.37251420 2.61209613 2.07211223 8.90626020
[7] 1.85354309 1.96263411 1.10087851 7.04624993 1.56621848 0.78813845
[13] 3.37468305 3.77347055 2.78904427 0.99063959 5.87881205 0.08359811
[19] 1.47435780 1.45810005 1.80081271 5.88148893 2.52555955 2.13920930
[25] 2.10664213
> Now you can get the Mahalanobis distance values for further analysis that's all.

Research Mining

This is default featured slide 1 title

This is default featured slide 2 title

This is default featured slide 3 title

This is default featured slide 4 title

This is default featured slide 5 title

Saturday 3 May 2014

Univariate Outlier Detection Based On Normal Distribution

Thursday 1 May 2014

Mahalanobis Distance using R code

Comment

Recent

BTemplates.com

Search This Blog

Blog Archive

Labels

Translate

Report Abuse

About Me

Featured post

Mahalanobis Distance using R code

Weekly

Labels

Blog Archive

Labels

Blogroll

About