CLASSIFICATION AND CLUSTERING
Classification and Prediction,
Issues, Decision Tree Induction, Bayesian Classification, Association Rule
Based, Other Classification Methods, Prediction, Classifier Accuracy, Cluster
Analysis, Types of data, Categorisation of methods, Partitioning methods,
Outlier Analysis.
4.1. Classification and Prediction:
Classification and prediction are two forms of data
analysis that can be used to extract models describing important data classes
or to predict future data trends. Such analysis can help provide us with a
better understanding of the data at large. Whereas classification predicts
categorical (discrete, unordered) labels, prediction models continuous
valued functions.
A model or
classifier is constructed to predict categorical labels, such as
“safe” or “risky” for the loan application data; “yes” or “no” for the
marketing data; or “treatment A,” “treatment B,” or “treatment C” for the
medical data. These categories can be represented by discrete values, where the
ordering among values has no meaning.
Predictor
where the model constructed predicts a continuous-valued function, or ordered
value, as opposed to a categorical label. This model is a predictor.
Classification and numeric prediction are the two
major types of prediction problems.
DATA CLASSIFICATION :
Data classification is a two-step
process,
In the first
step, a classifier is built describing a predetermined set of data classes
or concepts. This is the learning step
(or training phase), where a
classification algorithm builds the classifier by analyzing or “learning from”
a training set made up of database tuples and their associated class labels.
A tuple, X, is represented by an n-dimensional
attribute vector, X = (x1, x2, …. , xn),
depicting n measurements made on the tuple from n database
attributes, respectively, A1, A2,.. , An
Figure shows ,The data classification process: (a) Learning: Training data are
analyzed by a classification algorithm.
Here, the class label attribute is loan decision, and the learned model
or classifier is represented in the form of classification rules. (b) Classification: Test data are
used to estimate the accuracy of the classification rules. If the accuracy is
considered acceptable, the rules can be applied to the classification of new
data tuples.
Each tuple, X, is assumed to belong to
a predefined class as determined by another database attribute called the class label attribute.
The individual tuples making up the training set are
referred to as training tuples and
are selected from the database under analysis.
supervised
learning (i.e., the learning of the classifier is
“supervised” in that it is told to which
class each training tuple belongs.)
It contrasts with unsupervised learning (or clustering), in which the class label of
each training tuple is not known, and the number or set of classes to be
learned may not be known in advance.
This first step of the classification process can
also be viewed as the learning of a mapping or function, y = f (X),
that can predict the associated class label y of a given tuple X.
This mapping is represented in the form of
classification rules, decision trees, or mathematical formulae.
In
the second step,
The
model is used for classification.
First, the predictive accuracy of the classifier is estimated. If we were to use the training set to measure the
accuracy of the classifier, this estimate would likely be optimistic, because
the classifier tends to overfit the data (i.e., during learning it may
incorporate some particular anomalies of the training data that are not present
in the general data set overall). Therefore, a test set is used, made up of
test tuples and their associated class labels. These tuples are randomly
selected from the general data set.
The accuracy of a classifier on a given test set is
the percentage of test set tuples that are correctly classified by the
classifier. The associated class label of each test tuple is compared with the
learned classifier’s class prediction for that tuple.
DATA PREDICTION :
Data
prediction is a two step process, similar to that of data classification.
However, for prediction, we lose the terminology of
“class label attribute” because the attribute for which values are being
predicted is continuous-valued (ordered) rather than categorical
(discrete-valued and unordered). The attribute can be referred to simply as the
predicted attribute.
Note
that prediction can also be viewed as a mapping or function, y= f (X),
where X is the input (e.g., a tuple describing a loan applicant),
and the output y is a continuous or ordered value (such as the predicted
amount that the bank can safely loan the applicant); That is, we wish to learn
a mapping or function that models the relationship between X and y.
4.2.ISSUES
REGARDING CLASSIFICATION AND PREDICTION :
The following preprocessing steps may be applied to
the data to help improve the accuracy,efficiency, and scalability of the
classification or prediction process.
4.2.1.
Preparing the Data for Classification and Prediction
The following preprocessing steps may be applied to
the data to help improve the accuracy, efficiency, and scalability of the
classification or prediction process.
Data
cleaning: This refers to the preprocessing of data in order
to remove or reduce noise (by applying smoothing techniques, for
example) and the treatment of missing values (e.g., by replacing a
missing value with the most commonly occurring value for that attribute, or
with the most probable value based on statistics).
Relevance
analysis: Many of the attributes in the data may be redundant.
Correlation
analysis
can be used to identify whether any two given attributes are statistically
related. Attribute subset selection
can be used in these cases to find a reduced set of attributes such that the
resulting probability distribution of the data classes is as close as possible
to the original distribution obtained using all attributes.
Data
transformation and reduction: The data may be
transformed by normalization, particularly when neural networks or methods
involving distance measurements are used in the learning step.Normalization
involves scaling all values for a given attribute so that they fall within a
small specified range, such as 1:0 to 1:0, or 0:0 to 1:0.
The data can also be transformed by generalizing it
to higher-level concepts.
Data can also be reduced by applying many other
methods, ranging from wavelet transformation and principle components analysis
to discretization techniques, such as binning, histogram analysis, and
clustering.
4.2.2
Comparing Classification and Prediction Methods :
Classification and prediction methods can be
compared and evaluated according to the
following
criteria:
Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to
correctly predict the class label of new or previously unseen data (i.e.,
tuples without class label information). Similarly, the accuracy of a predictor
refers to how well a given predictor can guess the value of the predicted
attribute for new or previously unseen data.
Speed:
This
refers to the computational costs involved in generating and using the given
classifier or predictor.
Robustness:
This is the ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values.
Scalability:
This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
Interpretability:
This refers to the level of understanding and insight that is provided by the
classifier or predictor.
No comments:
Post a Comment