Introduction:
An
outlier is an observation (or measurement) that is different with respect to
the other values contained in a given dataset. Outliers can be due to several
causes. The measurement can be incorrectly observed, recorded or entered into
the process computer, the observed datum can come from a different population
with respect to the normal situation and thus is correctly measured but
represents a rare event. In literature different definitions of outlier exist:
the most commonly referred are reported in the following:
Definitions:
“An
outlier is an observation that deviates so much from other observations as to arouse
suspicions that is was generated by a different mechanism “ (Hawkins, 1980).
“An
outlier is an observation (or subset of observations) which appear to be inconsistent
with the remainder of the dataset” (Barnet & Lewis, 1994).
“An
outlier is an observation that lies outside the overall pattern of a
distribution” (Moore and McCabe, 1999).
“Outliers are those data records that do not follow any pattern in an
application” (Chen and al., 2002).
“An
outlier in a set of data is an observation or a point that is considerably
dissimilar or inconsistent with the remainder of the data” (Ramasmawy at al.,
2000).
Many data
mining algorithms try to minimize the influence of outliers for instance on a
final model to develop, or to eliminate them in the data pre-processing phase.
However, a data miner should be careful when automatically detecting and
eliminating outliers because, if the data are correct, their elimination can
cause the loss of important hidden information (Kantardzic,
2003). Some data mining applications are focused on outlier detection and they are
the essential result of a data-analysis (Sane & Ghatol, 2006).
The
outlier detection techniques find applications in credit card fraud, network
robustness analysis, network intrusion detection, financial applications and
marketing (Han & Kamber, 2001).
A more exhaustive list of applications that exploit
outlier detection is provided below (Hodge, 2004):
- Fraud detection: fraudulent applications for credit cards, state benefits or fraudulent usage of credit cards or mobile phones.
- Loan application processing: fraudulent applications or potentially problematical customers.
- Intrusion detection, such as unauthorized access in computer networks.
- Activity monitoring: for instance the detection of mobile phone fraud by monitoring phone activity or suspicious trades in the equity markets.
- Network performance: monitoring of the performance of computer networks, for example to detect network bottlenecks.
- Fault diagnosis: processes monitoring to detect faults for instance in motors, generators, and pipelines.
- Structural defect detection, such as monitoring of manufacturing lines to detect faulty production runs.
- Satellite image analysis: identification of novel features or misclassified features.
- Detecting novelties in images (for robot neotaxis or surveillance systems).
- Motion segmentation: such as detection of the features of moving images independently on the background.
- Time-series monitoring: monitoring of safety critical applications such as drilling or high-speed milling.
- Medical condition monitoring (such as heart rate monitors).
- Pharmaceutical research (identifying novel molecular structures).
- Detecting novelty in text. To detect the onset of news stories, for topic detection and tracking or for traders to pinpoint equity, commodities.
- Detecting unexpected entries in databases (in data mining application, to the aim of detecting errors, frauds or valid but unexpected entries).
- Detecting mislabeled data in a training data set.
How
the outlier detection system deals with the outlier depends on the application
area. A system should use a classification algorithm that is robust to outliers
to model data with naturally occurring outlier points. In any case the system
must detect outlier in real time and alert the system administrator. Once the
situation has been handled, the anomalous reading may be separately stored for
comparison with any new case but would probably not be stored with the main
system data as these techniques tend to model normality and use outliers to
detect anomalies (Hodge, 2004).
Reference:
- Barnet, V. & Lewis, T. (1994), Outliers in statistical data, John Wiley, ISBN 0-471-93094-6, Chichester.
- Chen, Z.; Fu, A. & Tang, J., (2002). Detection of outliered Patterns, Dept. of CSE, Chinese University of Hong Kong.
- Han, J. & Kamber M. (2001) Data Minings Concepts and Techniques, Morgan Kauffman Publisdhers.
- Hawkins, D. (1980), Identification of Outliers, Chapman and Hall, London.
- Hodge, V.J. (2004), A survey of outlier detection methodologies, Kluver Academic Publishers, Netherlands, January 2004.
- Kantardzic, M. (2003). Data mining Concepts, Models, Methods and Algorithms. IEEE Transactions on neural networks, Vol.14, N. 2, March 2003.
- Moore, D.S. & McCabe G.P. (1999), Introduction to the Practice of Statistics. , Freeman &Company.
- Ramasmawy R.; Rastogi R. & Kyuseok S. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.427-438, ISBN 1-58113-217-4, Dallas, Texas, United States.
- Sane, S. & Ghatol, A. (2006), Use of Instance Tipicality for Efficient Detection of Outliers with neural network Classifiers. Proceedings of 9thInternational Conference on Information Technology, ISBN 0-7695-2635-7.
0 comments:
Post a Comment