블로그 이미지
Leeway is... the freedom that someone has to take the action they want to or to change their plans.
maetel

Notice

Recent Post

Recent Comment

Recent Trackback

Archive

calendar

1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
  • total
  • today
  • yesterday

Category

215p
5.1 Introduction

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Linear_classifier

http://en.wikipedia.org/wiki/Linear_discriminant_analysis


linear discriminant functions
1) linear in the components of a feature vector
2) linear in some given set of functions of a feature vector

finding a linear discriminant function
-> minimizing a criterion fuction - sample risk / training error
: "A small training error does not guarantee a small test error."
=> to derive the minimum-risk linear discriminant

-> "convergence properties & computational complexities" of gradient descent procedures for minimizing criterion functions


216p
5.2 Linear Discriminant Functions and Decision Surfaces

weight vector
bias / threshold weight

5.2.1 The Two-Category Case

"The effective input is the sum of each input feauture value multiplied by its corresponding weight."

http://en.wikipedia.org/wiki/Hyperplane

"The weight vector is normal to any vector lying in the hyperplane."

A linear discriminant function divides the feature space by a hyperplane decision surface.

The orientation of the surface is determined by the normal vector w.
The location of the surface is determined by the bias w_0.

The discriminant function g(x) is proportional to the signed distance from x to the hyperplane.


218p
5.2.2 The Multicategory Case

A linear machine divides the feature space into c decision regions.

If the decision regions are contiguous, the boundary between them is a portion of the hyperplane.

With the linear machine, the decision boundary is determined by the differences of the weight vectors.

There are c(c-1)/2 paris of regions.


The linear machines are most suitable for problems for which the conditional densities are unimodal.


219p
5.3 Generalized Linear Discriminant Functions
 

http://en.wikipedia.org/wiki/Scaling_(geometry)
if the scaled matrix is
i) the positive multiple of the identity matrix
    => hypershere
ii) positive definete
    => hyperellipsoid
iii) some of eigenvalues of it are positive and others negative
    => various hyperhyperboloids
==> general multivariate Gaussian case

 
polynomial discriminant function
-> generalized linear discriminant function
: not linear in the feature vector x but linear in arbitrary functions y (phi function) of x

The mapping from x-space to y-space comes to find a homogeneous linear discriminant function.

Even with relatively simple functions y(x), decision surfaces induced in an x-space can be fairly complex.

the curse of dimensionality
-> the number of samples should be not less than the number of degrees of freedom. (-> ch.9)

http://en.wikipedia.org/wiki/Big_O_notation


http://en.wikipedia.org/wiki/Margin_(machine_learning)
The margin of a single data point is defined to be the distance from the data point to a decision boundary.
cf. VC dimension Tutorial Slides by Andrew Moore


http://en.wikipedia.org/wiki/Artificial_neural_network

http://en.wikipedia.org/wiki/Multilayer_perceptron


augmented feature vector, y
augmented weight vector, a

The hyperplane desicion surface in y-space passes through the origin in y-space.

The mapping from d-dim x-space to (d+1)-dim y-space preserves all distance relationships among samples.
-> The distance from y to the transformed hyperplane is less than or equal to the distance from x to the original hyperplane.



223p
5.4 The Two-Category Linearly Separable Case


separating vector / solution vector (to normalize samples in the feature space)

http://en.wikipedia.org/wiki/Weight_(representation_theory)

5.4.1 Geometry and Terminology


The linear discriminant equation defines a hyperplane through the origin of weight space having y_i as a normal vector.

The solution vector must be on the positivie side of every hyperplane and lie in the intersection of n half-spaces. -> solution region

-> To find a solution vector closer to the "middle" of the solution region
-> margin


224p
5.4.2 Gradient Descent Procedures

To define a criterion function J(a) that is minimized if a is a solution vector

(1) BGD = Basic Gradient Descent
The next weight vector is obtained by moving some distance from the former one in the direction of steepest descent, along the negative of the gradient of a criterion function.

(2) Newton's algorithm
- To minimize the second-order expansion for the quadratic error functions
- To leads to greater improvement per step
- computational burden of inverting the Hessian matrix

learning rate (to set the step size)

http://en.wikipedia.org/wiki/Gradient_descent
The (negative) gradient at a point is orthogonal to the contour line going through that point.

http://en.wikipedia.org/wiki/Hessian_matrix

http://en.wikipedia.org/wiki/Taylor_series
The Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point.
(As the degree of the Taylor polynomial rises, it approaches the correct function.

http://en.wikipedia.org/wiki/Newton%27s_method

http://en.wikipedia.org/wiki/Quasi-Newton_method



227p
5.5 Minimizing The Perception Criterion Function

5.5.1 The Perceptron Criterion Function

http://en.wikipedia.org/wiki/Perceptron

Perceptron criterion function
- never negative, being zero only if a is a solution vector or if a is on the decision boundary.
- propotional to the sum of the distances from the misclassified samples to the decision boundary

Perceptron criterion is piecewise linear and acceptable for gradient descent.


batch training
batch Perceptron algorithm:
The next weight vector is obtained by adding some multiple of the sum of the misclassified samples to the present weight vector.


229p
5.5.2 Convergence Proof for Single-Sample Correction

http://en.wikipedia.org/wiki/Convergence_of_random_variables

We shall modify the weight vector whenever it misclassifies a single sample.
For the purposes of the convergence proof, repeat the samples cyclically.
: We must store and revisit all of the training patterns.
: We shall only change the weight vector when there is an error

Fixed-Increment Single-Sample Perceptron
The correction is moving the weight vector in a good direction, until the samples are linearly separable.

Fig.5.13
A correction which is proportional to the pattern vector is added to the weight vector.
Each correction brings the weight vector closer to the solution region.

Theorem 5.1. Perceptron Convergence
If training samples are linearly separable, then the sequence of weight vectors given by Fixed-Increment Single-Sample Perceptron algorithm will terminate at a solution vector.


232p
5.5.3 Some Direct Generalizations

Algorithm 5: Variable-Increment Percceptron with Margin
correction with a variable increment and a margin

Algorithm 6: Batch Variable Increment Perceptron
the trajectory of the weight vector is smoothed

Algorithm 7: Balanced Winnow
for separable data the gap determined by the two contituent weight vectors can naver increase in size. -> a convergence proof
http://en.wikipedia.org/wiki/Winnow_(algorithm)


235p
5.6 Relaxation Procedures

5.6.1 The Descent Algorithm

a squared error function:
- The gradient is countinuous(, whereas the gradient of a Perceptron criterion function is not.)
- to search a smoother surface
- so smoother near the boundary of the solution region (that the sequence of weight vectors can converge to a point on the boundary)
- the gradient reaches the boundary point when the weight vector is zero vector
- dominated by the longest sample vectors

a squared error with margin:
- never negative, (zero iff weighted samples misclassified by the weight is equal to or more than the margin)

Algorithm 8: Batch Relaxation with Margin

Algorithm9: Single-Sample Relaxation with Margin

relaxation
under-relaxation
over-relaxation


237p
5.6.2 Convergence Proof


238p
5.7 Nonseparable Behavior


error-correcting procedures to call for a modification of the weight vector when and only when an error is encountered

The length of the weight vectors produced by the fixed-increment rule are bounded (, fluctuating near some limiting value).

If the components of the samples are integer-valued, the fixed-increment procedure yields a finite-state process.

Averaging the weight vectors can reduce the risk of obtaining a bad solution.



5.8 Minimum Squared-error Procedures


5.8.4 The Widrow-Hoff or LMS Procedure

http://en.wikipedia.org/wiki/Least_mean_squares_filter


5.8.5 Stochastic Approximation Methods

http://en.wikipedia.org/wiki/Stochastic_gradient_descent


5.9 The Ho-Koshyap Procedures


5.11 Support Vector Machines


5.12 Multicategory Generalizations

5.12.1 Kesler's construction

to convert many multicategory error-correction procedures to two-category procedures for the purpose of obtaining a convergence proof

5.12.2 Convergence of the Fixed-Increment Rule

5.12.3 Generalizaions for MSE Procedures


posted by maetel