The Research Mining Technology

Wednesday, 3 October 2012

Sample applications of outlier detection

Fraud detection
Purchasing behavior of a credit card owner usually changes when the card is stolen.
Abnormal buying patterns can characterize credit card abuse
Medicine
Unusual symptoms or test results may indicate potential health problems of a patient
Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age …)
Public health
The occurrence of a particular disease, e.g. tetanus, scattered across various hospitals of a city indicate problems with the corresponding vaccination program in that city.
Whether an occurrence is abnormal depends on different aspects like frequency, spatial correlation, etc.
Sports statistics
In many sports, various parameters are recorded for players in order to evaluate the players’ performances
Outstanding (in a positive as well as a negative sense) players may be identified as having abnormal parameter values
Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters
Detecting measurement errors
Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors
Abnormal values could provide an indication of a measurement error
Removing such errors can be important in other data mining and data analysis tasks
“One person‘s noise could be another person‘s signal.”
Share:

Tuesday, 2 October 2012

Types of Outliers

An important aspect of an outlier detection technique is the nature of the desired outlier. Outliers can be classified into following three categories:
  • Point Outliers
  • Contextual Outliers
  • Collective Outliers.
Point Outliers:
If an individual data instance can be considered as anomalous with respect to the rest of data, then the instance is termed as a point outlier. This is the simplest type of outlier and is the focus of majority of research on outlier detection. For example, in Figure 1, points o1 and o2 as well as points in region O3 lie outside the boundary of the normal regions, and hence are point outliers since they are different from normal data points. As a real life example, if we consider credit card fraud detection with data set corresponding to an individual's credit card transactions assuming data definition by only one feature: amount spent. A transaction for which the amount spent is very high compared to the normal range of expenditure for that person will be a point outlier.
Contextual Outliers:
If a data instance is anomalous in a specific con-text (but not otherwise), then it is termed as a contextual outlier (also referred to as conditional outlier [1]). The notion of a context is induced by the structure in the data set and has to be specified as a part of the problem formulation. Each data instance is defined using two sets of attributes:
Contextual attributes. The contextual attributes are used to determine the context (or neighborhood) for that instance. For example, in spatial data sets, the longitude and latitude of a location are the contextual attributes. In time series data, time is a contextual attribute which determines the position of an instance on the entire sequence.
Behavioral attributes. The behavioral attributes define the non-contextual characteristics of an instance. For example, in a spatial data set describing the average rainfall of the entire world, the amount of rainfall at any location is a behavioral attribute.
The anomalous behavior is determined using the values for the behavioral attributes within a specific context. A data instance might be a contextual outlier in a given context, but an identical data instance (in terms of behavioral attributes) could be considered normal in a different context. This property is key in identifying contextual and behavioral attributes for a contextual

Contextual outlier t2 in a temperature time series. Temperature at time t1 is same as that at time t2 but occurs in a different context and hence is not considered as an outlier.
Contextual outliers have been most commonly explored in time-series data [2] and spatial data [3]. Figure 3 shows one such example for a temperature time series which shows the monthly temperature of an area over last few years. A temperature of 35F might be normal during the winter (at time t1) at that place, but the same value during summer (at time t2) would be an outlier. A six ft tall adult may be a normal person but if viewed in context of age a six feet tall kid will definitely be an outlier.
A similar example can be found in the credit card fraud detection with contextual as time of purchase. Suppose an individual usually has a weekly shopping bill of $100 except during the Christmas week, when it reaches $1000. A new purchase of $1000 in a week in July will be considered a contextual outlier, since it does not conform to the normal behavior of the individual in the context of time (even though the same amount spent during Christmas week will be considered normal).
The choice of applying a contextual outlier detection technique is determined by the meaningfulness of the contextual outliers in the target application domain. Applying a contextual outlier detection technique makes sense if contextual attributes are readily available and therefore defining a context is straightforward. But it becomes difficult to apply such techniques if defining a context is not easy.
Collective Outliers:
If a collection of related data instances is anomalous with respect to the entire data set, it is termed as a collective outlier. The individual data instances in a collective outlier may not be outliers by themselves, but their occurrence together as a collection is anomalous. Figure 4 illustrates an example which shows a human electrocardiogram output [4]. The highlighted region denotes an outlier because the same low value exists for an abnormally long time (corresponding to an Atrial Premature Contraction). It may be noted that low value by itself is not an outlier but its successive occurrence for long time is an outlier.

Collective outlier in an human ECG output corresponding to an
Atrial Premature Contraction.
As an another illustrative example, consider a sequence of actions occurring in a computer as shown below: ……...http-web, buffer-overflow, http-web, http-web, smtp-mail, ftp, http-web, ssh, smtp-mail, http-web, ssh, buffer-overflow, ftp, http-web, ftp, smtp-mail, httpweb…… The highlighted sequence of events (buffer-overflow, ssh, ftp) correspond to a typical web based attack by a remote machine followed by copying of data from the host computer to remote destination via ftp. It should be noted that this collection of events is an outlier but the individual events are not outliers when they occur in other locations in the sequence.
Collective outliers have been explored for sequence data [5,6], graph data [7], and spatial data [8]. It should be noted that while point outliers can occur in any data set, collective outliers can occur only in data sets in which data instances are related. In contrast, occurrence of contextual outliers depends on the availability of context attributes in the data. A point outlier or a collective outlier can also be a contextual outlier if analyzed with respect to a context. Thus a point outlier detection problem or collective outlier detection problem can be transformed to a contextual outlier detection problem by incorporating the context information.
Reference:
Forrest, S., Warrender, C., and Pearlmutter, B. 1999. Detecting intrusions using system calls: Alternate data models. In Proceedings of the 1999 IEEE ISRSP. IEEE Computer Society, Washington, DC, USA, 133 - 145.
Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for com-plex physiologic signals. Circulation 101, 23, e215 - e220. Circulation Electronic Pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215.
Kou, Y., Lu, C.-T., and Chen, D. 2006. Spatial weighted outlier detection. In Proceedings of SIAM Conference on Data Mining.
Noble, C. C. and Cook, D. J. 2003. Graph-based outlier detection. In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, 631 - 636.
Sekar, R., Bendre, M., Dhurjati, D., and Bollineni, P. 2001. A fast automaton-based method for detecting anomalous program behaviors. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society, 144.
Song, X., Wu, M., Jermaine, C., and Ranka, S. (2007). Conditional outlier detection. IEEE Transactions on Knowledge and Data Engineering 19, 5, 631-645.
Sun, P., Chawla, S., and Arunasalam, B. 2006. Mining for outliers in sequential databases. In SIAM International Conference on Data Mining.
Weigend, A. S., Mangeas, M., and Srivastava, A. N. (1995). Nonlinear gated experts for time-series – discovering regimes and avoiding overfitting. International Journal of Neural Systems 6, 4, 373-399.
Share:

introduction and applications of outliers



Introduction:
        An outlier is an observation (or measurement) that is different with respect to the other values contained in a given dataset. Outliers can be due to several causes. The measurement can be incorrectly observed, recorded or entered into the process computer, the observed datum can come from a different population with respect to the normal situation and thus is correctly measured but represents a rare event. In literature different definitions of outlier exist: the most commonly referred are reported in the following:
Definitions:
An outlier is an observation that deviates so much from other observations as to arouse suspicions that is was generated by a different mechanism “ (Hawkins, 1980).
An outlier is an observation (or subset of observations) which appear to be inconsistent with the remainder of the dataset” (Barnet & Lewis, 1994).
An outlier is an observation that lies outside the overall pattern of a distribution” (Moore and McCabe, 1999).
Outliers are those data records that do not follow any pattern in an application” (Chen and al., 2002).
An outlier in a set of data is an observation or a point that is considerably dissimilar or inconsistent with the remainder of the data” (Ramasmawy at al., 2000).
Many data mining algorithms try to minimize the influence of outliers for instance on a final model to develop, or to eliminate them in the data pre-processing phase. However, a data miner should be careful when automatically detecting and eliminating outliers because, if the data are correct, their elimination can cause the loss of important hidden information (Kantardzic, 2003). Some data mining applications are focused on outlier detection and they are the essential result of a data-analysis (Sane & Ghatol, 2006).
The outlier detection techniques find applications in credit card fraud, network robustness analysis, network intrusion detection, financial applications and marketing (Han & Kamber, 2001).
A more exhaustive list of applications that exploit outlier detection is provided below (Hodge, 2004):
  • Fraud detection: fraudulent applications for credit cards, state benefits or fraudulent usage of credit cards or mobile phones.
  • Loan application processing: fraudulent applications or potentially problematical customers.
  • Intrusion detection, such as unauthorized access in computer networks.
  • Activity monitoring: for instance the detection of mobile phone fraud by monitoring phone activity or suspicious trades in the equity markets.
  • Network performance: monitoring of the performance of computer networks, for example to detect network bottlenecks.
  • Fault diagnosis: processes monitoring to detect faults for instance in motors, generators, and pipelines.
  • Structural defect detection, such as monitoring of manufacturing lines to detect faulty production runs.
  • Satellite image analysis: identification of novel features or misclassified features.
  • Detecting novelties in images (for robot neotaxis or surveillance systems).
  • Motion segmentation: such as detection of the features of moving images independently on the background.
  • Time-series monitoring: monitoring of safety critical applications such as drilling or high-speed milling.
  • Medical condition monitoring (such as heart rate monitors).
  • Pharmaceutical research (identifying novel molecular structures).
  • Detecting novelty in text. To detect the onset of news stories, for topic detection and tracking or for traders to pinpoint equity, commodities.
  • Detecting unexpected entries in databases (in data mining application, to the aim of detecting errors, frauds or valid but unexpected entries).
  • Detecting mislabeled data in a training data set.
How the outlier detection system deals with the outlier depends on the application area. A system should use a classification algorithm that is robust to outliers to model data with naturally occurring outlier points. In any case the system must detect outlier in real time and alert the system administrator. Once the situation has been handled, the anomalous reading may be separately stored for comparison with any new case but would probably not be stored with the main system data as these techniques tend to model normality and use outliers to detect anomalies (Hodge, 2004).
Reference:
  1. Barnet, V. & Lewis, T. (1994), Outliers in statistical data, John Wiley, ISBN 0-471-93094-6, Chichester.
  2. Chen, Z.; Fu, A. & Tang, J., (2002). Detection of outliered Patterns, Dept. of CSE, Chinese University of Hong Kong.
  3. Han, J. & Kamber M. (2001) Data Minings Concepts and Techniques, Morgan Kauffman Publisdhers.
  4. Hawkins, D. (1980), Identification of Outliers, Chapman and Hall, London.
  5. Hodge, V.J. (2004), A survey of outlier detection methodologies, Kluver Academic Publishers, Netherlands, January 2004.
  6. Kantardzic, M. (2003). Data mining Concepts, Models, Methods and Algorithms. IEEE Transactions on neural networks, Vol.14, N. 2, March 2003.
  7. Moore, D.S. & McCabe G.P. (1999), Introduction to the Practice of Statistics. , Freeman &Company.
  8. Ramasmawy R.; Rastogi R. & Kyuseok S. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.427-438, ISBN 1-58113-217-4, Dallas, Texas, United States.
  9. Sane, S. & Ghatol, A. (2006), Use of Instance Tipicality for Efficient Detection of Outliers with neural network Classifiers. Proceedings of 9thInternational Conference on Information Technology, ISBN 0-7695-2635-7.
Share:

Monday, 1 October 2012

Glossary of data mining

Glossary of data mining terms 

Accuracy
Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied to models, accuracy refers to the degree of fit between the model and the data. This measures how error-free the model’s predictions are. Since accuracy does not include cost information, it is possible for a less accurate model to be more cost-effective. Also see precision.
Activation function
A function used by a node in a neural net to transform input data from any domain of values into a finite range of values. The original idea was to approximate the way neurons fired, and the activation function took on the value 0 until the input became large and the value jumped to 1. The discontinuity of this 0-or-1 function caused mathematical problems, and sigmoid-shaped functions (e.g., the logistic function) are now used.
Antecedent
When an association between two variables is defined, the first item (or left-hand side) is called the antecedent. For example, in the relationship “When a prospector buys a pick, he buys a shovel 14% of the time,” “buys a pick” is the antecedent.
API
An application program interface. When a software system features an API, it provides a means by which programs written outside of the system can interface with the system to perform additional functions. For example, a data mining software system may have an API which permits user-written programs to perform such tasks as extract data, perform additional statistical analysis, create specialized charts, generate a model, or make a prediction from a model.
Associations
An association algorithm creates rules that describe how often events have occurred together. For example, “When prospectors buy picks, they also buy shovels 14% of the time.” Such relationships are typically expressed with a confidence interval.
Back propagation
A training method used to calculate the weights in a neural net from the data.
Bias
In a neural network, bias refers to the constant terms in the model. (Note that bias has a different meaning to most data analysts.) Also see precision.
Binning
A data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For example, age could be converted to bins such as 20 or under, 21-40, 41-65 and over 65.
Bootstrapping
Training data sets are created by re-sampling with replacement from the original training set, so data records may occur more than once. In other words, this method treats a sample as if it were the entire population. Usually, final estimates are obtained by taking the average of the estimates from each of the bootstrap test sets.
CART
Classification And Regression Trees. CART is a method of splitting the independent variables into small groups and fitting a constant function to the small data sets. In categorical trees, the constant function is one that takes on a finite small set of values (e.g., Y or N, low or medium or high). In regression trees, the mean value of the response is fit to small connected data sets.
Categorical data
Categorical data fits into a small number of discrete categories (as opposed to continuous). Categorical data is either non-ordered (nominal) such as gender or city, or ordered (ordinal) such as high, medium, or low temperatures.
CHAID
An algorithm for fitting categorical trees. It relies on the chi-squared statistic to split the data into small connected data sets.
Chi-squared
A statistic that assesses how well a model fits the data. In data mining, it is most commonly used to find homogeneous subsets for fitting categorical trees as in CHAID.
Classification
Refers to the data mining problem of attempting to predict the category of categorical data by building a model based on some predictor variables.
Classification tree
A decision tree that places categorical variables into classes.
Cleaning (cleansing)
Refers to a step in preparing data for a data mining activity. Obvious data errors are detected and corrected (e.g., improbable dates) and missing data is replaced.
Clustering
Clustering algorithms find groups of items that are similar. For example, clustering could be used by an insurance company to group customers according to income, age, types of policies purchased and prior claims experience. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Since the categories are unspecified, this is sometimes referred to as unsupervised learning.
Confidence
Confidence of rule “B given A” is a measure of how much more likely it is that B occurs when A has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has occurred. Statisticians refer to this as the conditional probability of B given A. When used with association rules, the term confidence is observational rather than predictive. (Statisticians also use this term in an unrelated way. There are ways to estimate an interval and the probability that the interval contains the true value of a parameter is called the interval confidence. So a 95% confidence interval for the mean has a probability of .95 of covering the true value of the mean.)
Confusion matrix
A confusion matrix shows the counts of the actual versus predicted class values. It shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong.
Consequent
When an association between two variables is defined, the second item (or right-hand side) is called the consequent. For example, in the relationship “When a prospector buys a pick, he buys a shovel 14% of the time,” “buys a shovel” is the consequent.
Continuous
Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete or categorical.nbsp;
Cross validation
A method of estimating the accuracy of a classification or regression model. The data set is divided into several parts, with each part in turn used to test a model fitted to the remaining parts.
Data
Values collected through record keeping or by polling, observing, or measuring, typically organized for analysis or decision making. More simply, data is facts, transactions and figures.
Data format
Data items can exist in many formats such as text, integer and floating-point decimal. Data format refers to the form of the data in the database.
Data mining
An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.
Data mining method
Procedures and algorithms designed to analyze the data in databases.
DBMS
Database management systems.
Decision tree
A tree-like way of representing a collection of hierarchical rules that lead to a class or value.
Deduction
Deduction infers information that is a logical consequence of the data.
Degree of fit
A measure of how closely the model fits the training data. A common measure is r-square.
Dependent variable
The dependent variables (outputs or responses) of a model are the variables predicted by the equation or rules of the model using the independent variables (inputs or predictors).
Deployment
After the model is trained and validated, it is used to analyze new data and make predictions. This use of the model is called deployment.
Dimension
Each attribute of a case or occurrence in the data being
mined. Stored as a field in a flat file record or a column of relational database table.
Discrete
A data item that has a finite set of values. Discrete is the opposite of continuous.
Discriminate analysis
A statistical method based on maximum likelihood for determining boundaries that separate the data into categories.
Entropy
A way to measure variability other than the variance statistic. Some decision trees split the data into groups based on minimum entropy.
Exploratory analysis
Looking at data to discover relationships not previously detected. Exploratory analysis tools typically assist the user in creating tables and graphical displays.
External data
Data not collected by the organization, such as data available from a reference book, a government source or a proprietary database.
Feed-forward
A neural net in which the signals only flow in one direction, from the inputs to the outputs.
Fuzzy logic
Fuzzy logic is applied to fuzzy sets where membership in a fuzzy set is a probability, not necessarily 0 or 1. Non-fuzzy logic manipulates outcomes that are either true or false. Fuzzy logic needs to be able to manipulate degrees of “maybe” in addition to true and false.
Genetic algorithms
A computer-based method of generating and testing combinations of possible input parameters to find the optimal output. It uses processes based on natural evolution concepts such as genetic combination, mutation and natural selection.
GUI
Graphical User Interface.
Hidden nodes
The nodes in the hidden layers in a neural net. Unlike input and output nodes, the number of hidden nodes is not predetermined. The accuracy of the resulting model is affected by the number of hidden nodes. Since the number of hidden nodes directly affects the number of parameters in the model, a neural net needs a sufficient number of hidden nodes to enable it to properly model the underlying behavior. On the other hand, a net with too many hidden nodes will overfit the data. Some neural net products include algorithms that search over a number of alternative neural nets by varying the number of hidden nodes, in the end choosing the model that gets the best results without overfitting.
Independent variable
The independent variables (inputs or predictors) of a model are the variables used in the equation or rules of the model to predict the output (dependent) variable.
Induction
A technique that infers generalizations from the information in the data.
Interaction
Two independent variables interact when changes in the value of one change the effect on the dependent variable of the other.
Internal data
Data collected by an organization such as operating and customer data.
K-nearest neighbor
A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).
Kohonen feature map
A type of neural network that uses unsupervised learning to find patterns in data. In data mining it is employed for cluster analysis.
Layer
Nodes in a neural net are usually grouped into layers, with each layer described as input, output or hidden. There are as many input nodes as there are input (independent) variables and as many output nodes as there are output (dependent) variables. Typically, there are one or two hidden layers.
Leaf
A node not further split — the terminal grouping — in a classification or decision tree.
Learning
Training models (estimating their parameters) based on existing data.
Least squares
The most common method of training (estimating) the weights (parameters) of a model by choosing the weights that minimize the sum of the squared deviation of the predicted values of the model from the observed values of the data.
Left-hand side
When an association between two variables is defined, the first item is called the left-hand side (or antecedent). For example, in the relationship “When a prospector buys a pick, he buys a shovel 14% of the time”, “buys a pick” is the left-hand side.
Logistic regression (logistic discriminant analysis)
A generalization of linear regression. It is used for predicting a binary variable (with values such as yes/no or 0/1). An example of its use is modeling the odds that a borrower will default on a loan based on the borrower’s income, debt and age.
MARS
Multivariate Adaptive Regression Splines. MARS is a generalization of a decision tree.
Maximum likelihood
Another training or estimation method. The maximum likelihood estimate of a parameter is the value of a parameter that maximizes the probability that the data came from the population defined by the parameter.
Mean
The arithmetic average value of a collection of numeric data.
Median
The value in the middle of a collection of ordered data. In other words, the value with the same number of items above and below it.
Missing data
Data values can be missing because they were not measured, not answered, were unknown or were lost. Data mining methods vary in the way they treat missing values. Typically, they ignore the missing values, or omit any records containing missing values, or replace missing values with the mode or mean, or infer missing values from existing values.
Mode
The most common value in a data set. If more than one value occurs the same number of times, the data is multi-modal.
Model
An important function of data mining is the production of a model. A model can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input). The form of the equation or rules is suggested by mining data collected from the process under study. Some training or estimation technique is used to estimate the parameters of the equation or rules.
MPP
Massively parallel processing, a computer configuration that is able to use hundreds or thousands of CPUs simultaneously. In MPP each node may be a single CPU or a collection of SMP CPUs. An MPP collection of SMP nodes is sometimes called an SMP cluster. Each node has its own copy of the operating system, memory, and disk storage, and there is a data or process exchange mechanism so that each computer can work on a different part of a problem. Software must be written specifically to take advantage of this architecture.
Neural network
A complex nonlinear modeling technique based on a model of a human neuron. A neural net is used to predict outputs (dependent
variables) from a set of inputs (independent variables) by taking linear combinations of the inputs and then making nonlinear transformations of the linear combinations using an activation function. It can be shown theoretically that such combinations and transformations can approximate virtually any type of response function. Thus, neural nets use large numbers of parameters to approximate any model. Neural nets are often applied to predict future outcome based on prior experience. For example, a neural net application could be used to predict who will respond to a direct mailing.
Node
A decision point in a classification (i.e., decision) tree. Also, a point in a neural net that combines input from other nodes and produces an output through application of an activation function.
Noise
The difference between a model and its predictions. Sometimes data is referred to as noisy when it contains errors such as many missing or incorrect values or when there are extraneous columns.
Non-applicable data
Missing values that would be logically impossible (e.g., pregnant males) or are obviously not relevant.
Normalize
A collection of numeric data is normalized by subtracting the minimum value from all values and dividing by the range of the data. This yields data with a similarly shaped histogram but with all values between 0 and 1. It is useful to do this for all inputs into neural nets and also for inputs into other regression models. (Also see standardize.)
OLAP
On-Line Analytical Processing tools give the user the capability to perform multi-dimensional analysis of the data.
Optimization criterion
A positive function of the difference between predictions and data estimates that are chosen so as to optimize the function or criterion. Least squares and maximum likelihood are examples.
Outliers
Technically, outliers are data items that did not (or are thought not to have) come from the assumed population of data — for example, a non-numeric when you are expecting only numeric values. A more casual usage refers to data items that fall outside the boundaries that enclose most other data items in the data set.
Over fitting
A tendency of some modeling techniques to assign importance to random variations in the data by declaring them important patterns.
Overlay
Data not collected by the organization, such as data from a proprietary database, that is combined with the organization’s own data.
Parallel processing
Several computers or CPUs linked together so that each can be computing simultaneously.
Pattern
Analysts and statisticians spend much of their time looking for patterns in data. A pattern can be a relationship between two variables. Data mining techniques include automatic pattern discovery that makes it possible to detect complicated non-linear relationships in data. Patterns are not the same as causality.
Precision
The precision of an estimate of a parameter in a model is a measure of how variable the estimate would be over other similar data sets. A very precise estimate would be one that did not vary much over different data sets. Precision does not measure accuracy. Accuracy is a measure of how close the estimate is to the real value of the parameter. Accuracy is measured by the average distance over different data sets of the estimate from the real value. Estimates can be accurate but not precise, or precise but not accurate. A precise but inaccurate estimate is usually biased, with the bias equal to the average distance from the real value of the parameter.
Predictability
Some data mining vendors use predictability of associations or sequences to mean the same as confidence.
Prevalence
The measure of how often the collection of items in an association occur together as a percentage of all the transactions. For example, “In 2% of the purchases at the hardware store, both a pick and a shovel were bought.”
Pruning
Eliminating lower level splits or entire sub-trees in a decision tree. This term is also used to describe algorithms that adjust the topology of a neural net by removing (i.e., pruning) hidden nodes.
Range
The range of the data is the difference between the maximum value and the minimum value. Alternatively, range can include the minimum and maximum, as in “The value ranges from 2 to 8.”
RDBMS
Relational Database Management System.
Regression tree
A decision tree that predicts values of continuous variables.
Resubstitution error
The estimate of error based on the differences between the predicted values of a trained model and the observed values in the training set.
Right-hand side
When an association between two variables is defined, the second item is called the right-hand side (or consequent). For example, in the relationship “When a prospector buys a pick, he buys a shovel 14% of the time,” “buys a shovel” is the right-hand side.
R-squared
A number between 0 and 1 that measures how well a model fits its training data. One is a perfect fit; however, zero implies the model has no predictive ability. It is computed as the covariance between the predicted and observed values divided by the standard deviations of the predicted and observed values.
Sampling
Creating a subset of data from the whole. Random sampling attempts to represent the whole by choosing the sample through a random mechanism.
Sensitivity analysis
Varying the parameters of a model to assess the change in its output.
Sequence discovery
The same as association, except that the time sequence of events is also considered. For example, “Twenty percent of the people who buy a VCR buy a camcorder within four months.”
Significance
A probability measure of how strongly the data support a certain result (usually of a statistical test). If the significance of a result is said to be .05, it means that there is only a .05 probability that the result could have happened by chance alone. Very low significance (less than .05) is usually taken as evidence that the data mining model should be accepted since events with very low probability seldom occur. So if the estimate of a parameter in a model showed a significance of .01 that would be evidence that the parameter must be in the model.
SMP
Symmetric multi-processing is a computer configuration where many CPUs share a common operating system, main memory and disks. They can work on different parts of a problem at the same time.
Standardize
A collection of numeric data is standardized by subtracting a measure of central location (such as the mean or median) and by dividing by some measure of spread (such as the standard deviation, interquartile range or range). This yields data with a similarly shaped histogram with values centered around 0. It is sometimes useful to do this with inputs into neural nets and also inputs into other regression models. (Also see normalize.)
Supervised learning
The collection of techniques where analysis uses a well-defined (known) dependent variable. All regression and classification techniques are supervised.
Support
The measure of how often the collection of items in an association occur together as a percentage of all the transactions. For example, “In 2% of the purchases at the hardware store, both a pick and a shovel were bought.”
Test data
A data set independent of the training data set, used to fine-tune the estimates of the model parameters (i.e., weights).
Test error
The estimate of error based on the difference between the predictions of a model on a test data set and the observed values in the test data set when the test data set was not used to train the model.
Time series
A series of measurements taken at consecutive points in time. Data mining products which handle time series incorporate time-related operators such as moving average. (Also see windowing.)
Time series model
A model that forecasts future values of a time series based on past values. The model form and training of the model usually take into consideration the correlation between values as a function of their separation in time.
Topology
For a neural net, topology refers to the number of layers and the number of nodes in each layer.
Training
Another term for estimating a model’s parameters based on the data set at hand.
Training data
A data set used to estimate or train a model.
Transformation
A re-expression of the data such as aggregating it, normalizing it, changing its unit of measure, or taking the logarithm of each data item.
Unsupervised learning
This term refers to the collection of techniques where groupings of the data are defined without the use of a dependent variable. Cluster analysis is an example.
Validation
The process of testing the models with a data set different from the training data set.
Variance
The most commonly used statistical measure of dispersion. The first step is to square the deviations of a data item from its average value. Then the average of the squared deviations is calculated to obtain an overall measure of variability.
Visualization
Visualization tools graphically display data to facilitate better understanding of its meaning. Graphical capabilities range from simple scatter plots to complex multi-dimensional representations.
Windowing
Used when training a model with time series data. A window is the period of time used for each training case. For example, if we have weekly stock price data that covers fifty weeks, and we set the window to five weeks, then the first training case uses weeks one through five and compares its prediction to week six. The second case uses weeks two through six to predict week seven, and so on.
Share:

Comment

BTemplates.com

Search This Blog

Powered by Blogger.

Translate

About Me

My photo
Tirunelveli, Tamil Nadu, India

Featured post

Mahalanobis Distance using R code

Mahalanobis distance is one of the standardized distance measure in statistics. It is a unit less distance measure introduced by P. C. Mah...

Weekly