215p
5.1 Introduction
5.2 Linear Discriminant Functions and Decision Surfaces
weight vector
bias / threshold weight
5.2.1 The Two-Category Case
"The effective input is the sum of each input feauture value multiplied by its corresponding weight."
http://en.wikipedia.org/wiki/Hyperplane
"The weight vector is normal to any vector lying in the hyperplane."
A linear discriminant function divides the feature space by a hyperplane decision surface.
The orientation of the surface is determined by the normal vector w.
The location of the surface is determined by the bias w_0.
The discriminant function g(x) is proportional to the signed distance from x to the hyperplane.
218p
5.2.2 The Multicategory Case
A linear machine divides the feature space into c decision regions.
If the decision regions are contiguous, the boundary between them is a portion of the hyperplane.
With the linear machine, the decision boundary is determined by the differences of the weight vectors.
There are c(c-1)/2 paris of regions.
The linear machines are most suitable for problems for which the conditional densities are unimodal.
219p
5.3 Generalized Linear Discriminant Functions
http://en.wikipedia.org/wiki/Scaling_(geometry)
if the scaled matrix is
i) the positive multiple of the identity matrix
=> hypershere
ii) positive definete
=> hyperellipsoid
iii) some of eigenvalues of it are positive and others negative
=> various hyperhyperboloids
==> general multivariate Gaussian case
polynomial discriminant function
-> generalized linear discriminant function
: not linear in the feature vector x but linear in arbitrary functions y (phi function) of x
The mapping from x-space to y-space comes to find a homogeneous linear discriminant function.
Even with relatively simple functions y(x), decision surfaces induced in an x-space can be fairly complex.
5.4 The Two-Category Linearly Separable Case
separating vector / solution vector (to normalize samples in the feature space)
http://en.wikipedia.org/wiki/Weight_(representation_theory)
5.4.1 Geometry and Terminology
The linear discriminant equation defines a hyperplane through the origin of weight space having y_i as a normal vector.
The solution vector must be on the positivie side of every hyperplane and lie in the intersection of n half-spaces. -> solution region
-> To find a solution vector closer to the "middle" of the solution region
-> margin
224p
5.4.2 Gradient Descent Procedures
To define a criterion function J(a) that is minimized if a is a solution vector
(1) BGD = Basic Gradient Descent
The next weight vector is obtained by moving some distance from the former one in the direction of steepest descent, along the negative of the gradient of a criterion function.
(2) Newton's algorithm
- To minimize the second-order expansion for the quadratic error functions
- To leads to greater improvement per step
- computational burden of inverting the Hessian matrix
learning rate (to set the step size)
http://en.wikipedia.org/wiki/Gradient_descent
The (negative) gradient at a point is orthogonal to the contour line going through that point.
http://en.wikipedia.org/wiki/Hessian_matrix
http://en.wikipedia.org/wiki/Taylor_series
The Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point.
(As the degree of the Taylor polynomial rises, it approaches the correct function.
http://en.wikipedia.org/wiki/Newton%27s_method
http://en.wikipedia.org/wiki/Quasi-Newton_method
227p
5.5 Minimizing The Perception Criterion Function
5.5.1 The Perceptron Criterion Function
http://en.wikipedia.org/wiki/Perceptron
Perceptron criterion function
- never negative, being zero only if a is a solution vector or if a is on the decision boundary.
- propotional to the sum of the distances from the misclassified samples to the decision boundary
Perceptron criterion is piecewise linear and acceptable for gradient descent.
batch training
batch Perceptron algorithm:
The next weight vector is obtained by adding some multiple of the sum of the misclassified samples to the present weight vector.
229p
5.5.2 Convergence Proof for Single-Sample Correction
http://en.wikipedia.org/wiki/Convergence_of_random_variables
We shall modify the weight vector whenever it misclassifies a single sample.
For the purposes of the convergence proof, repeat the samples cyclically.
: We must store and revisit all of the training patterns.
: We shall only change the weight vector when there is an error
Fixed-Increment Single-Sample Perceptron
The correction is moving the weight vector in a good direction, until the samples are linearly separable.
Fig.5.13
A correction which is proportional to the pattern vector is added to the weight vector.
Each correction brings the weight vector closer to the solution region.
Theorem 5.1. Perceptron Convergence
If training samples are linearly separable, then the sequence of weight vectors given by Fixed-Increment Single-Sample Perceptron algorithm will terminate at a solution vector.
232p
5.5.3 Some Direct Generalizations
Algorithm 5: Variable-Increment Percceptron with Margin
correction with a variable increment and a margin
Algorithm 6: Batch Variable Increment Perceptron
the trajectory of the weight vector is smoothed
Algorithm 7: Balanced Winnow
for separable data the gap determined by the two contituent weight vectors can naver increase in size. -> a convergence proof
http://en.wikipedia.org/wiki/Winnow_(algorithm)
235p
5.6 Relaxation Procedures
5.6.1 The Descent Algorithm
a squared error function:
- The gradient is countinuous(, whereas the gradient of a Perceptron criterion function is not.)
- to search a smoother surface
- so smoother near the boundary of the solution region (that the sequence of weight vectors can converge to a point on the boundary)
- the gradient reaches the boundary point when the weight vector is zero vector
- dominated by the longest sample vectors
a squared error with margin:
- never negative, (zero iff weighted samples misclassified by the weight is equal to or more than the margin)
Algorithm 8: Batch Relaxation with Margin
Algorithm9: Single-Sample Relaxation with Margin
relaxation
under-relaxation
over-relaxation
237p
5.6.2 Convergence Proof
238p
5.7 Nonseparable Behavior
error-correcting procedures to call for a modification of the weight vector when and only when an error is encountered
The length of the weight vectors produced by the fixed-increment rule are bounded (, fluctuating near some limiting value).
If the components of the samples are integer-valued, the fixed-increment procedure yields a finite-state process.
Averaging the weight vectors can reduce the risk of obtaining a bad solution.
5.8 Minimum Squared-error Procedures