Data Mining Functionalities:
Data
Mining, also popularly known as Knowledge
Discovery in Databases (KDD), refers to the nontrivial extraction of
implicit, previously unknown and potentially useful information from data in
databases. While data mining and knowledge discovery in databases (or KDD) are
frequently treated as synonyms, data mining is actually part of the knowledge
discovery process. The following figure (Figure 1.1) shows data mining as a
step in an iterative knowledge discovery process.
The
Knowledge Discovery in Databases process comprises of a few steps leading from
raw data collections to some form of new knowledge. The iterative process
consists of the following steps:
- Data cleaning: also known as data cleansing, it is a phase in which
noise data and irrelevant data are removed from the collection.
- Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
- Data selection: at this step, the data relevant to the analysis is
decided on and retrieved from the data collection.
- Data transformation: also known as data consolidation, it is a phase in
which the selected data is transformed into forms appropriate for the
mining procedure.
- Data mining: it is the crucial step in which clever techniques are
applied to extract patterns potentially useful.
- Pattern evaluation: in this step, strictly interesting patterns
representing knowledge are identified based on given measures.
- Knowledge representation: is the final phase in which the discovered knowledge
is visually represented to the user. This essential step uses
visualization techniques to help users understand and interpret the data
mining results.
Data mining methods may be
classified by the function they perform or according to the class of
application they can be used in. Some of the main techniques used in data
mining are described in this section.
Data
mine tools have to infer a model from the database, and in the case of
supervised learning this requires the user to define one or more classes. The
database contains one or more attributes that denote the class of a tuple and
these are known as predicted attributes whereas the remaining attributes are
called predicting attributes. A combination of values for the predicted
attributes defines a class.
When
learning classification rules the system has to find the rules that predict the
class from the predicting attributes so firstly the user has to define
conditions for each class, the data mine system then constructs descriptions
for the classes. Basically the system should given a case or tuple with certain
known attribute values be able to predict what class this case belongs to.
Once
classes are defined the system should infer rules that govern the
classification therefore the system should be able to find the description of
each class. The descriptions should only refer to the predicting attributes of
the training set so that the positive examples should satisfy the description
and none of the negative. A rule said to be correct if its description covers
all the positive examples and none of the negative examples of a class.
A
rule is generally presented as, if the left hand side (LHS) then the right hand
side (RHS), so that in all instances where LHS is true then RHS is also true,
is very probable. The categories of rules are:
- exact rule - permits no
exceptions so each object of LHS must be an element of RHS
- strong rule - allows some
exceptions, but the exceptions have a given limit
- probabilistic rule - relates
the conditional probability P(RHS|LHS) to the probability P(RHS)
Other
types of rules are classification rules where LHS is a sufficient condition to
classify objects as belonging to the concept referred to in the RHS.
Given
a collection of items and a set of records, each of which contain some number
of items from the given collection, an association function is an operation
against this set of records which return affinities or patterns that exist
among the collection of items. These patterns can be expressed by rules such as
"72% of all the records that contain items A, B and C also contain items D
and E." The specific percentage of occurrences (in this case 72) is called
the confidence factor of the rule. Also, in this rule, A,B and C are said to be
on an opposite side of the rule to D and E. Associations can involve any number
of items on either side of the rule.
A
typical application, identified by IBM that can be built using an association
function is Market Basket Analysis. This is where a retailer runs an
association operator over the point of sales transaction log, which contains
among other information, transaction identifiers and product identifiers. The
set of products identifiers listed under the same transaction identifier
constitutes a record. The output of the association function is, in this case,
a list of product affinities. Thus, by invoking an association function, the
market basket analysis application can determine affinities such as "20%
of the time that a specific brand toaster is sold, customers also buy a set of
kitchen gloves and matching cover sets."
Another
example of the use of associations is the analysis of the claim forms submitted
by patients to a medical insurance company. Every claim form contains a set of
medical procedures that were performed on a given patient during one visit. By
defining the set of items to be the collection of all medical procedures that
can be performed on a patient and the records to correspond to each claim form,
the application can find, using the association function, relationships among
medical procedures that are often performed together.
Sequential/temporal
pattern functions analyze a collection of records over a period of time for
example to identify trends. Where the identity of a customer who made a
purchase is known an analysis can be made of the collection of related records
of the same structure (i.e. consisting of a number of items drawn from a given
collection of items). The records are related by the identity of the customer
who did the repeated purchases. Such a situation is typical of a direct mail
application where for example a catalogue merchant has the information, for
each customer, of the sets of products that the customer buys in every purchase
order. A sequential pattern function will analyze such collections of related
records and will detect frequently occurring patterns of products bought over
time. A sequential pattern operator could also be used to discover for example
the set of purchases that frequently precedes the purchase of a microwave oven.
Sequential
pattern mining functions are quite powerful and can be used to detect the set
of customers associated with some frequent buying patterns. Use of these
functions on for example a set of insurance claims can lead to the
identification of frequently occurring sequences of medical procedures applied
to patients which can help identify good medical practices as well as to
potentially detect some medical insurance fraud.
Clustering
and segmentation are the processes of creating a partition so that all the
members of each set of the partition are similar according to some metric. A
cluster is a set of objects grouped together because of their similarity or
proximity. Objects are often decomposed into an exhaustive and/or mutually
exclusive set of clusters.
Clustering
according to similarity is a very powerful technique, the key to it being to
translate some intuitive measure of similarity into a quantitative measure.
When learning is unsupervised then the system has to discover its own classes
i.e. the system clusters the data in the database. The system has to discover subsets
of related objects in the training set and then it has to find descriptions
that describe each of these subsets.
There
are a number of approaches for forming clusters. One approach is to form rules
which dictate membership in the same group based on the level of similarity
between members. Another approach is to build set functions that measure some
property of partitions as functions of some parameter of the partition.
IBM
have used segmentation techniques in their Market Basket Analysis on POS
transactions where they separate a set of untagged input records into
reasonable groups according to product revenue by market basket i.e. the market
baskets were segmented based on the number and type of products in the
individual baskets.
Each
segment reports total revenue and number of baskets and using a neural network
275,000 transaction records were divided into 16 segments. The following types
of analysis were also available, revenue by segment, baskets by segment,
average revenue by segment etc.
Thanks for sharing your valuable blog, it is really stunning. But, i want to share an amazing process mining platform that opens the new door of new business opportunities.
ReplyDelete