Support
Vector Machines
The Case When the Data Are Linearly
Separable
To explain the
mystery of SVMs, let’s first look at the simplest case—a two-class problem
where the classes are linearly separable. Let the data set D be given as (X1, y1), (X2,
y2), : : : , (XjDj, yjDj),whereXi is the
setof training tupleswith associatedclass labels, yi. Each yi can
take one of two values, either+1 or1 (i.e., yi 2 f+1, 1g), corresponding to the classes buys computer = yes and buys computer = no, respectively. To
aid in visualization, let’s consider an example based on two input attributes, A1 and A2. From the graph, we see that the 2-D data are linearly
separable (or “linear,” for short) because a straight line can be drawn to
separate all of the tuples of class+1 from all of the tuples of class1. There are an infinite number of separating lines
that could be drawn.We want to find the “best” one, that is, one that (we hope)
will have the minimum classification error on previously unseen tuples.How can
we find this best line?Note that if our datawere 3-D (i.e.,with three
attributes),wewould want to find the best separating plane. Generalizing to n
dimensions, we want to find the best hyperplane.We will use the term“hyperplane” to refer to the
decision boundary that we are seeking, regardless of the number of input
attributes.
A separating hyperplane can be
written as W_X+b = 0;
whereW is a weight vector, namely,W = fw1, w2, : : : , wng; n is the
number of attributes; and b is
a scalar, often referred to as a bias. To aid in visualization, let’s consider
two input attributes, A1 and A2,. Training tuples are 2-D, e.g., X = (x1, x2), where x1 and x2 are the values of attributes A1 and A2,
respectively, for X. If
we think of b as an additional
weight, w0, we can rewrite the
above separating hyperplane as w0+w1x1+w2x2 = 0:
Thus, any point that lies above
the separating hyperplane satisfies w0+w1x1+w2x2 > 0:
Similarly, any point that lies
below the separating hyperplane satisfies w0+w1x1+w2x2 < 0:
The weights can
be adjusted so that the hyperplanes defining the “sides” of the margin can be
written as
H1
: w0+w1x1+w2x2 _ 1 for yi =
+1, and
H2
: w0+w1x1+w2x2 _ 1 for yi = 1:
That is, any
tuple that falls on or above H1
belongs to class +1, and any tuple that falls on or below H2 belongs to class 1.
yi(w0+w1x1+w2x2) _ 1, 8i.
Any training
tuples that fall on hyperplanes H1
or H2 (i.e., the “sides”
defining the margin) and are called support vectors. That is, they are equally
close to the (separating) MMH. the support vectors are shown encircled with a
thicker border. Essentially, the support vectors are the most difficult tuples
to classify and give the most
information regarding classification. From the above, we can obtain a formulae
for the size of the maximal margin. The distance from the separating hyperplane
to any point on H1 is 1 jjWjj , where jjWjj is the Euclidean norm of W, that is p
W_W.9 By definition, this is equal
to the distance from any point on H2
to the separating hyperplane. Therefore, the maximal margin is 2 jjWjj .
“So, howdoes anSVMfind theMMHand the support vectors?”Using some
“fancy math tricks,”
No comments:
Post a Comment