The Research Mining Technology

Tuesday 2 October 2012

introduction and applications of outliers



Introduction:
        An outlier is an observation (or measurement) that is different with respect to the other values contained in a given dataset. Outliers can be due to several causes. The measurement can be incorrectly observed, recorded or entered into the process computer, the observed datum can come from a different population with respect to the normal situation and thus is correctly measured but represents a rare event. In literature different definitions of outlier exist: the most commonly referred are reported in the following:
Definitions:
An outlier is an observation that deviates so much from other observations as to arouse suspicions that is was generated by a different mechanism “ (Hawkins, 1980).
An outlier is an observation (or subset of observations) which appear to be inconsistent with the remainder of the dataset” (Barnet & Lewis, 1994).
An outlier is an observation that lies outside the overall pattern of a distribution” (Moore and McCabe, 1999).
Outliers are those data records that do not follow any pattern in an application” (Chen and al., 2002).
An outlier in a set of data is an observation or a point that is considerably dissimilar or inconsistent with the remainder of the data” (Ramasmawy at al., 2000).
Many data mining algorithms try to minimize the influence of outliers for instance on a final model to develop, or to eliminate them in the data pre-processing phase. However, a data miner should be careful when automatically detecting and eliminating outliers because, if the data are correct, their elimination can cause the loss of important hidden information (Kantardzic, 2003). Some data mining applications are focused on outlier detection and they are the essential result of a data-analysis (Sane & Ghatol, 2006).
The outlier detection techniques find applications in credit card fraud, network robustness analysis, network intrusion detection, financial applications and marketing (Han & Kamber, 2001).
A more exhaustive list of applications that exploit outlier detection is provided below (Hodge, 2004):
  • Fraud detection: fraudulent applications for credit cards, state benefits or fraudulent usage of credit cards or mobile phones.
  • Loan application processing: fraudulent applications or potentially problematical customers.
  • Intrusion detection, such as unauthorized access in computer networks.
  • Activity monitoring: for instance the detection of mobile phone fraud by monitoring phone activity or suspicious trades in the equity markets.
  • Network performance: monitoring of the performance of computer networks, for example to detect network bottlenecks.
  • Fault diagnosis: processes monitoring to detect faults for instance in motors, generators, and pipelines.
  • Structural defect detection, such as monitoring of manufacturing lines to detect faulty production runs.
  • Satellite image analysis: identification of novel features or misclassified features.
  • Detecting novelties in images (for robot neotaxis or surveillance systems).
  • Motion segmentation: such as detection of the features of moving images independently on the background.
  • Time-series monitoring: monitoring of safety critical applications such as drilling or high-speed milling.
  • Medical condition monitoring (such as heart rate monitors).
  • Pharmaceutical research (identifying novel molecular structures).
  • Detecting novelty in text. To detect the onset of news stories, for topic detection and tracking or for traders to pinpoint equity, commodities.
  • Detecting unexpected entries in databases (in data mining application, to the aim of detecting errors, frauds or valid but unexpected entries).
  • Detecting mislabeled data in a training data set.
How the outlier detection system deals with the outlier depends on the application area. A system should use a classification algorithm that is robust to outliers to model data with naturally occurring outlier points. In any case the system must detect outlier in real time and alert the system administrator. Once the situation has been handled, the anomalous reading may be separately stored for comparison with any new case but would probably not be stored with the main system data as these techniques tend to model normality and use outliers to detect anomalies (Hodge, 2004).
Reference:
  1. Barnet, V. & Lewis, T. (1994), Outliers in statistical data, John Wiley, ISBN 0-471-93094-6, Chichester.
  2. Chen, Z.; Fu, A. & Tang, J., (2002). Detection of outliered Patterns, Dept. of CSE, Chinese University of Hong Kong.
  3. Han, J. & Kamber M. (2001) Data Minings Concepts and Techniques, Morgan Kauffman Publisdhers.
  4. Hawkins, D. (1980), Identification of Outliers, Chapman and Hall, London.
  5. Hodge, V.J. (2004), A survey of outlier detection methodologies, Kluver Academic Publishers, Netherlands, January 2004.
  6. Kantardzic, M. (2003). Data mining Concepts, Models, Methods and Algorithms. IEEE Transactions on neural networks, Vol.14, N. 2, March 2003.
  7. Moore, D.S. & McCabe G.P. (1999), Introduction to the Practice of Statistics. , Freeman &Company.
  8. Ramasmawy R.; Rastogi R. & Kyuseok S. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.427-438, ISBN 1-58113-217-4, Dallas, Texas, United States.
  9. Sane, S. & Ghatol, A. (2006), Use of Instance Tipicality for Efficient Detection of Outliers with neural network Classifiers. Proceedings of 9thInternational Conference on Information Technology, ISBN 0-7695-2635-7.
Share:

0 comments:

Post a Comment

Comment

BTemplates.com

Search This Blog

Powered by Blogger.

Translate

About Me

My photo
Tirunelveli, Tamil Nadu, India

Featured post

Mahalanobis Distance using R code

Mahalanobis distance is one of the standardized distance measure in statistics. It is a unit less distance measure introduced by P. C. Mah...

Weekly