Constraint based Association
mining
Constraint-based rule miners find all rules in a given dataset meeting
user-specified constraints such as minimum support and confidence. We describe
a new algorithm that directly exploits all user-specified constraints including
minimum support, minimum confidence, and a new constraint that ensures every
mined rule offers a predictive advantage over any of its simplifications. Our
algorithm maintains efficiency even at low supports on data that is dense (e.g.
relational data). Previous approaches such as Apriori and its variants exploit
only the minimum support constraint, and as a result are ineffective on dense
data due to a combinatorial explosion of “frequent item sets”.
Mining rules
from data is a problem that has attracted considerable interest because a rule
provides a concise statement of potentially useful information that is easily understood
by end users. In the database literature, the focus has been on developing
association rule algorithms that identify all conjunctive rules meeting
user-specified constraints such as minimum support (a statement of generality) and
minimum confidence (a statement of predictive ability). These algorithms were
initially developed to tackle data-sets primarily from the domain of
market-basket analysis, though there has been recent interest in applying these
algorithms to other domains including telecommunications data analysis, census
data analysis, and classification and predictive modeling tasks in general.
Unlike data from market-basket analysis, these data-sets tend to be dense in that they have any or all of
the following properties:
• Many
frequently occurring items (e.g. sex=male);
• Strong
correlations between several items;
• Many items in
each record.
These data-sets
can cause an exponential blow-up in the resource consumption of standard
association rule mining algorithms including Apriori and its many variants. The
combinatorial explosion is a result of the fact that these algorithms
effectively mine all rules that satisfy only the minimum support constraint,
the number of which is exorbitant. Though other rule constraints are
specifiable, they are typically enforced solely during a post-processing filter
step. Our approach to mining on dense data-sets is to instead directly enforce
all user-specified rule constraints during mining. For example, most
association rule miners allow users to set a minimum on the predictive ability
of any mined rule specified as either a minimum confidence or an alternative
measure such as lift (also known as interest or conviction. Our algorithm can
exploit such minimums on predictive ability during mining for vastly improved
efficiency. Even given strong minimums on support and predictive ability, the
rules satisfying these constraints in a dense dataset are often too numerous to
be mined efficiently or comprehended by the end user. To remedy this problem,
our algorithm exploits another constraint that eliminates rules that are
uninteresting because they contain conditions that do not (strongly) contribute
to the predictive ability of the rule. To illustrate this useful concept, first
consider the rule below:
Bread & Butter Milk (Confidence = 80%)
This rule has a
confidence of 80%, which says that 80% of the people who purchase bread and
butter also purchase the item in the consequent
of the rule, which is milk. Because of its high confidence, one might be
inclined to believe that this rule is an interesting finding if the goal is to,
say, understand the population of likely milk buyers in order to make better
stocking and discounting decisions. However, if 85% of the population under
examination purchased milk, this rule is actually quite uninteresting for this purpose
since it characterizes a population that is even less likely to buy milk than
the average shopper. This point has motivated additional measures for
identifying interesting rules including lift and conviction. Both lift and
conviction represent the predictive advantage a rule offers over simply guessing
based on the frequency of the consequent. But both measures still exhibit
another closely related problem illustrated by the next rule.
Eggs & Cereal Milk (Confidence = 95%)
Because the
confidence of this rule (95%) is significantly higher than the frequency with
which milk is purchased (85%), this rule will have lift and conviction values
that could imply to the end-user that it is interesting for understanding likely
milk buyers. But suppose the purchase of cereal alone implies that milk is
purchased with 99% confidence. We then have that the above rule actually
represents concise rule which is more broadly applicable (because there are
more people who buy cereal than people who buy both cereal and eggs). To
address these problems, our algorithm allows the user to specify a minimum improvement constraint. The idea
is to mine only those rules whose confidence is at least minimp greater than the confidence of
any of its proper sub-rules, where
a proper sub-rule is a simplification of the rule formed by removing one or
more conditions from its antecedent. Any positive setting of minimp would
prevent the undesirable rules from the examples above from being generated by
our algorithm. More generally, the minimum improvement constraint remedies the
rule explosion problem resulting from the fact that in dense data-sets, the
confidence of many rules can often be marginally improved upon in an
overwhelming number of ways by adding additional conditions. For example, given
the rule stating that cereal implies milk with 99% confidence, there may be
hundreds of rules of the form below with a confidence between 99% and 99.1%.
Cereal & & & & Milk
By specifying a
small positive value for minimp, one can trade away such marginal benefits in
predictive ability for a far more concise set of rules, with the added property
that every returned rule consists entirely of items that are strong contributors
to its predictive ability. We feel this is a worthwhile trade-off in most
situations where the mined rules are used for end-user understanding.
No comments:
Post a Comment