Glossary of data
mining terms
Accuracy
Accuracy is an important factor in assessing the success
of data mining. When applied to data, accuracy refers to the rate of correct
values in the data. When applied to models, accuracy refers to the degree of
fit between the model and the data. This measures how error-free the model’s
predictions are. Since accuracy does not include cost information, it is
possible for a less accurate model to be more cost-effective. Also see
precision.
Activation
function
A function used by a node in a neural net to transform
input data from any domain of values into a finite range of values. The
original idea was to approximate the way neurons fired, and the activation
function took on the value 0 until the input became large and the value jumped
to 1. The discontinuity of this 0-or-1 function caused mathematical problems,
and sigmoid-shaped functions (e.g., the logistic function) are now used.
Antecedent
When an association between two variables is defined,
the first item (or left-hand side) is called the antecedent. For example, in
the relationship “When a prospector buys a pick, he buys a shovel 14% of the
time,” “buys a pick” is the antecedent.
API
An application program interface. When a software
system features an API, it provides a means by which programs written outside
of the system can interface with the system to perform additional functions.
For example, a data mining software system may have an API which permits
user-written programs to perform such tasks as extract data, perform additional
statistical analysis, create specialized charts, generate a model, or make a
prediction from a model.
Associations
An association algorithm creates rules that describe
how often events have occurred together. For example, “When prospectors buy
picks, they also buy shovels 14% of the time.” Such relationships are typically
expressed with a confidence interval.
A training method used to calculate the weights in a
neural net from the data.
Bias
In a neural network, bias refers to the constant terms
in the model. (Note that bias has a different meaning to most data analysts.)
Also see precision.
Binning
A data preparation activity that converts continuous
data to discrete data by replacing a value from a continuous range with a bin
identifier, where each bin represents a range of values. For example, age could
be converted to bins such as 20 or under, 21-40, 41-65 and over 65.
Bootstrapping
Training data sets are created by re-sampling with
replacement from the original training set, so data records may occur more than
once. In other words, this method treats a sample as if it were the entire
population. Usually, final estimates are obtained by taking the average of the
estimates from each of the bootstrap test sets.
Classification And Regression Trees. CART is a method
of splitting the independent variables into small groups and fitting a constant
function to the small data sets. In categorical trees, the constant function is
one that takes on a finite small set of values (e.g., Y or N, low or medium or
high). In regression trees, the mean value of the response is fit to small
connected data sets.
Categorical
data
Categorical data fits into a small number of discrete
categories (as opposed to continuous). Categorical data is either non-ordered
(nominal) such as gender or city, or ordered (ordinal) such as high, medium, or
low temperatures.
CHAID
An algorithm for fitting categorical trees. It relies
on the chi-squared statistic to split the data into small connected data sets.
Chi-squared
A statistic that assesses how well a model fits the data.
In data mining, it is most commonly used to find homogeneous subsets for
fitting categorical trees as in CHAID.
Classification
Refers to the data mining problem of attempting to
predict the category of categorical data by building a model based on some
predictor variables.
Classification
tree
A decision tree that places categorical variables into
classes.
Cleaning
(cleansing)
Refers to a step in preparing data for a data mining
activity. Obvious data errors are detected and corrected (e.g., improbable
dates) and missing data is replaced.
Clustering
Clustering algorithms find groups of items that are
similar. For example, clustering could be used by an insurance company to group
customers according to income, age, types of policies purchased and prior
claims experience. It divides a data set so that records with similar content
are in the same group, and groups are as different as possible from each other.
Since the categories are unspecified, this is sometimes referred to as
unsupervised learning.
Confidence
Confidence of rule “B given A” is a measure of how
much more likely it is that B occurs when A has occurred. It is expressed as a
percentage, with 100% meaning B always occurs if A has occurred. Statisticians
refer to this as the conditional probability of B given A. When used with
association rules, the term confidence is observational rather than predictive.
(Statisticians also use this term in an unrelated way. There are ways to
estimate an interval and the probability that the interval contains the true
value of a parameter is called the interval confidence. So a 95% confidence
interval for the mean has a probability of .95 of covering the true value of
the mean.)
Confusion
matrix
A confusion matrix shows the counts of the actual
versus predicted class values. It shows not only how well the model predicts,
but also presents the details needed to see exactly where things may have gone
wrong.
Consequent
When an association between two variables is defined,
the second item (or right-hand side) is called the consequent. For example, in
the relationship “When a prospector buys a pick, he buys a shovel 14% of the
time,” “buys a shovel” is the consequent.
Continuous
Continuous data can have any value in an interval of
real numbers. That is, the value does not have to be an integer. Continuous is
the opposite of discrete or categorical.nbsp;
Cross
validation
A method of estimating the accuracy of a
classification or regression model. The data set is divided into several parts,
with each part in turn used to test a model fitted to the remaining parts.
Values collected through record keeping or by polling,
observing, or measuring, typically organized for analysis or decision making.
More simply, data is facts, transactions and figures.
Data
format
Data items can exist in many formats such as text,
integer and floating-point decimal. Data format refers to the form of the data
in the database.
Data
mining
An information extraction activity whose goal is to
discover hidden facts contained in databases. Using a combination of machine
learning, statistical analysis, modeling techniques and database technology,
data mining finds patterns and subtle relationships in data and infers rules
that allow the prediction of future results. Typical applications include
market segmentation, customer profiling, fraud detection, evaluation of retail
promotions, and credit risk analysis.
Data
mining method
Procedures and algorithms designed to analyze the data
in databases.
DBMS
Database management systems.
Decision
tree
A tree-like way of representing a collection of
hierarchical rules that lead to a class or value.
Deduction
Deduction infers information that is a logical
consequence of the data.
Degree
of fit
A measure of how closely the model fits the training
data. A common measure is r-square.
Dependent
variable
The dependent variables (outputs or responses) of a
model are the variables predicted by the equation or rules of the model using
the independent variables (inputs or predictors).
Deployment
After the model is trained and validated, it is used
to analyze new data and make predictions. This use of the model is called
deployment.
Dimension
Each attribute of a case or occurrence in the data
being
mined. Stored as a field in a flat file record or a column of relational
database table.
Discrete
A data item that has a finite set of values. Discrete
is the opposite of continuous.
Discriminate
analysis
A statistical method based on maximum likelihood for
determining boundaries that separate the data into categories.
A way to measure variability other than the variance
statistic. Some decision trees split the data into groups based on minimum
entropy.
Exploratory
analysis
Looking at data to discover relationships not
previously detected. Exploratory analysis tools typically assist the user in
creating tables and graphical displays.
External
data
Data not collected by the organization, such as data
available from a reference book, a government source or a proprietary database.
A neural net in which the signals only flow in one
direction, from the inputs to the outputs.
Fuzzy
logic
Fuzzy logic is applied to fuzzy sets where membership
in a fuzzy set is a probability, not necessarily 0 or 1. Non-fuzzy logic
manipulates outcomes that are either true or false. Fuzzy logic needs to be
able to manipulate degrees of “maybe” in addition to true and false.
A computer-based method of generating and testing
combinations of possible input parameters to find the optimal output. It uses
processes based on natural evolution concepts such as genetic combination,
mutation and natural selection.
GUI
Graphical User Interface.
The nodes in the hidden layers in a neural net. Unlike
input and output nodes, the number of hidden nodes is not predetermined. The
accuracy of the resulting model is affected by the number of hidden nodes.
Since the number of hidden nodes directly affects the number of parameters in
the model, a neural net needs a sufficient number of hidden nodes to enable it
to properly model the underlying behavior. On the other hand, a net with too
many hidden nodes will overfit the data. Some neural net products include
algorithms that search over a number of alternative neural nets by varying the
number of hidden nodes, in the end choosing the model that gets the best
results without overfitting.
The independent variables (inputs or predictors) of a
model are the variables used in the equation or rules of the model to predict
the output (dependent) variable.
Induction
A technique that infers generalizations from the
information in the data.
Interaction
Two independent variables interact when changes in the
value of one change the effect on the dependent variable of the other.
Internal
data
Data collected by an organization such as operating
and customer data.
A classification method that classifies a point by
calculating the distances between the point and points in the training data
set. Then it assigns the point to the class that is most common among its
k-nearest neighbors (where k is an integer).
Kohonen
feature map
A type of neural network that uses unsupervised
learning to find patterns in data. In data mining it is employed for cluster
analysis.
Nodes in a neural net are usually grouped into layers,
with each layer described as input, output or hidden. There are as many input
nodes as there are input (independent) variables and as many output nodes as
there are output (dependent) variables. Typically, there are one or two hidden
layers.
Leaf
A node not further split — the terminal grouping — in
a classification or decision tree.
Learning
Training models (estimating their parameters) based on
existing data.
Least
squares
The most common method of training (estimating) the
weights (parameters) of a model by choosing the weights that minimize the sum
of the squared deviation of the predicted values of the model from the observed
values of the data.
Left-hand
side
When an association between two variables is defined,
the first item is called the left-hand side (or antecedent). For example, in
the relationship “When a prospector buys a pick, he buys a shovel 14% of the
time”, “buys a pick” is the left-hand side.
Logistic
regression (logistic discriminant analysis)
A generalization of linear regression. It is used for
predicting a binary variable (with values such as yes/no or 0/1). An example of
its use is modeling the odds that a borrower will default on a loan based on
the borrower’s income, debt and age.
MARS
Multivariate Adaptive Regression Splines. MARS is a
generalization of a decision tree.
Maximum
likelihood
Another training or estimation method. The maximum
likelihood estimate of a parameter is the value of a parameter that maximizes
the probability that the data came from the population defined by the
parameter.
Mean
The arithmetic average value of a collection of
numeric data.
Median
The value in the middle of a collection of ordered
data. In other words, the value with the same number of items above and below
it.
Missing
data
Data values can be missing because they were not
measured, not answered, were unknown or were lost. Data mining methods vary in
the way they treat missing values. Typically, they ignore the missing values,
or omit any records containing missing values, or replace missing values with
the mode or mean, or infer missing values from existing values.
Mode
The most common value in a data set. If more than one
value occurs the same number of times, the data is multi-modal.
Model
An important function of data mining is the production
of a model. A model can be descriptive or predictive. A descriptive model helps
in understanding underlying processes or behavior. For example, an association
model describes consumer behavior. A predictive model is an equation or set of
rules that makes it possible to predict an unseen or unmeasured value (the
dependent variable or output) from other, known values (independent variables
or input). The form of the equation or rules is suggested by mining data
collected from the process under study. Some training or estimation technique
is used to estimate the parameters of the equation or rules.
MPP
Massively parallel processing, a computer
configuration that is able to use hundreds or thousands of CPUs simultaneously.
In MPP each node may be a single CPU or a collection of SMP CPUs. An MPP
collection of SMP nodes is sometimes called an SMP cluster. Each node has its
own copy of the operating system, memory, and disk storage, and there is a data
or process exchange mechanism so that each computer can work on a different
part of a problem. Software must be written specifically to take advantage of
this architecture.
A complex nonlinear modeling technique based on a
model of a human neuron. A neural net is used to predict outputs (dependent
variables) from a set of inputs (independent variables) by taking linear
combinations of the inputs and then making nonlinear transformations of the
linear combinations using an activation function. It can be shown theoretically
that such combinations and transformations can approximate virtually any type
of response function. Thus, neural nets use large numbers of parameters to
approximate any model. Neural nets are often applied to predict future outcome
based on prior experience. For example, a neural net application could be used
to predict who will respond to a direct mailing.
Node
A decision point in a classification (i.e., decision)
tree. Also, a point in a neural net that combines input from other nodes and
produces an output through application of an activation function.
Noise
The difference between a model and its predictions.
Sometimes data is referred to as noisy when it contains errors such as many
missing or incorrect values or when there are extraneous columns.
Non-applicable
data
Missing values that would be logically impossible
(e.g., pregnant males) or are obviously not relevant.
Normalize
A collection of numeric data is normalized by subtracting
the minimum value from all values and dividing by the range of the data. This
yields data with a similarly shaped histogram but with all values between 0 and
1. It is useful to do this for all inputs into neural nets and also for inputs
into other regression models. (Also see standardize.)
On-Line Analytical Processing tools give the user the
capability to perform multi-dimensional analysis of the data.
Optimization
criterion
A positive function of the difference between
predictions and data estimates that are chosen so as to optimize the function
or criterion. Least squares and maximum likelihood are examples.
Outliers
Technically, outliers are data items that did not (or
are thought not to have) come from the assumed population of data — for
example, a non-numeric when you are expecting only numeric values. A more
casual usage refers to data items that fall outside the boundaries that enclose
most other data items in the data set.
Over
fitting
A tendency of some modeling techniques to assign
importance to random variations in the data by declaring them important
patterns.
Overlay
Data not collected by the organization, such as data
from a proprietary database, that is combined with the organization’s own data.
Several computers or CPUs linked together so that each
can be computing simultaneously.
Pattern
Analysts and statisticians spend much of their time
looking for patterns in data. A pattern can be a relationship between two
variables. Data mining techniques include automatic pattern discovery that
makes it possible to detect complicated non-linear relationships in data.
Patterns are not the same as causality.
Precision
The precision of an estimate of a parameter in a model
is a measure of how variable the estimate would be over other similar data
sets. A very precise estimate would be one that did not vary much over
different data sets. Precision does not measure accuracy. Accuracy is a measure
of how close the estimate is to the real value of the parameter. Accuracy is
measured by the average distance over different data sets of the estimate from
the real value. Estimates can be accurate but not precise, or precise but not
accurate. A precise but inaccurate estimate is usually biased, with the bias
equal to the average distance from the real value of the parameter.
Predictability
Some data mining vendors use predictability of
associations or sequences to mean the same as confidence.
Prevalence
The measure of how often the collection of items in an
association occur together as a percentage of all the transactions. For
example, “In 2% of the purchases at the hardware store, both a pick and a
shovel were bought.”
Pruning
Eliminating lower level splits or entire sub-trees in
a decision tree. This term is also used to describe algorithms that adjust the
topology of a neural net by removing (i.e., pruning) hidden nodes.
The range of the data is the difference between the
maximum value and the minimum value. Alternatively, range can include the
minimum and maximum, as in “The value ranges from 2 to 8.”
RDBMS
Relational Database Management System.
Regression
tree
A decision tree that predicts values of continuous
variables.
Resubstitution
error
The estimate of error based on the differences between
the predicted values of a trained model and the observed values in the training
set.
Right-hand
side
When an association between two variables is defined,
the second item is called the right-hand side (or consequent). For example, in
the relationship “When a prospector buys a pick, he buys a shovel 14% of the
time,” “buys a shovel” is the right-hand side.
R-squared
A number between 0 and 1 that measures how well a
model fits its training data. One is a perfect fit; however, zero implies the
model has no predictive ability. It is computed as the covariance between the
predicted and observed values divided by the standard deviations of the
predicted and observed values.
Creating a subset of data from the whole. Random
sampling attempts to represent the whole by choosing the sample through a
random mechanism.
Sensitivity
analysis
Varying the parameters of a model to assess the change
in its output.
Sequence
discovery
The same as association, except that the time sequence
of events is also considered. For example, “Twenty percent of the people who
buy a VCR buy a camcorder within four months.”
Significance
A probability measure of how strongly the data support
a certain result (usually of a statistical test). If the significance of a
result is said to be .05, it means that there is only a .05 probability that
the result could have happened by chance alone. Very low significance (less
than .05) is usually taken as evidence that the data mining model should be
accepted since events with very low probability seldom occur. So if the
estimate of a parameter in a model showed a significance of .01 that would be
evidence that the parameter must be in the model.
SMP
Symmetric multi-processing is a computer configuration
where many CPUs share a common operating system, main memory and disks. They
can work on different parts of a problem at the same time.
Standardize
A collection of numeric data is standardized by
subtracting a measure of central location (such as the mean or median) and by
dividing by some measure of spread (such as the standard deviation,
interquartile range or range). This yields data with a similarly shaped
histogram with values centered around 0. It is sometimes useful to do this with
inputs into neural nets and also inputs into other regression models. (Also see
normalize.)
Supervised
learning
The collection of techniques where analysis uses a
well-defined (known) dependent variable. All regression and classification
techniques are supervised.
Support
The measure of how often the collection of items in an
association occur together as a percentage of all the transactions. For
example, “In 2% of the purchases at the hardware store, both a pick and a
shovel were bought.”
A data set independent of the training data set, used
to fine-tune the estimates of the model parameters (i.e., weights).
Test
error
The estimate of error based on the difference between
the predictions of a model on a test data set and the observed values in the
test data set when the test data set was not used to train the model.
Time
series
A series of measurements taken at consecutive points
in time. Data mining products which handle time series incorporate time-related
operators such as moving average. (Also see windowing.)
Time
series model
A model that forecasts future values of a time series
based on past values. The model form and training of the model usually take
into consideration the correlation between values as a function of their
separation in time.
Topology
For a neural net, topology refers to the number of
layers and the number of nodes in each layer.
Training
Another term for estimating a model’s parameters based
on the data set at hand.
Training data
A data set used to estimate or train a model.
Transformation
A re-expression of the data such as aggregating it,
normalizing it, changing its unit of measure, or taking the logarithm of each
data item.
This term refers to the collection of techniques where
groupings of the data are defined without the use of a dependent variable.
Cluster analysis is an example.
The process of testing the models with a data set
different from the training data set.
Variance
The most commonly used statistical measure of
dispersion. The first step is to square the deviations of a data item from its
average value. Then the average of the squared deviations is calculated to
obtain an overall measure of variability.
Visualization
Visualization tools graphically display data to
facilitate better understanding of its meaning. Graphical capabilities range
from simple scatter plots to complex multi-dimensional representations.
Used when training a model with time series data. A
window is the period of time used for each training case. For example, if we
have weekly stock price data that covers fifty weeks, and we set the window to
five weeks, then the first training case uses weeks one through five and
compares its prediction to week six. The second case uses weeks two through six
to predict week seven, and so on.