2008. 11. 2. 16:51
@GSMC/김경환: Pattern Recognition
215p
http://en.wikipedia.org/wiki/Statistical_classification
http://en.wikipedia.org/wiki/Linear_classifier
http://en.wikipedia.org/wiki/Linear_discriminant_analysis
linear discriminant functions
1) linear in the components of a feature vector
2) linear in some given set of functions of a feature vector
finding a linear discriminant function
-> minimizing a criterion fuction - sample risk / training error
: "A small training error does not guarantee a small test error."
=> to derive the minimum-risk linear discriminant
-> "convergence properties & computational complexities" of gradient descent procedures for minimizing criterion functions
216p
quadratic discriminant function
2008/11/05 - [Method/Nature] - Jonathan Richard Shewchuk - An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
http://en.wikipedia.org/wiki/Quadratic_classifier
http://en.wikipedia.org/wiki/Quadric
The separating surface defined by the quadratic discriminant function is a second-degree or hyperquadric surface.
http://en.wikipedia.org/wiki/Hypersphere
http://en.wikipedia.org/wiki/Hyperboloid
the curse of dimensionality
-> the number of samples should be not less than the number of degrees of freedom. (-> ch.9)
http://en.wikipedia.org/wiki/Big_O_notation
http://en.wikipedia.org/wiki/Margin_(machine_learning)
The margin of a single data point is defined to be the distance from the data point to a decision boundary.
cf. VC dimension Tutorial Slides by Andrew Moore
http://en.wikipedia.org/wiki/Artificial_neural_network
http://en.wikipedia.org/wiki/Multilayer_perceptron
augmented feature vector, y
augmented weight vector, a
223p
separating vector / solution vector (to normalize samples in the feature space)
http://en.wikipedia.org/wiki/Weight_(representation_theory)
5.4.1 Geometry and Terminology
224p
5.4.2 Gradient Descent Procedures
learning rate (to set the step size)
http://en.wikipedia.org/wiki/Gradient_descent
The (negative) gradient at a point is orthogonal to the contour line going through that point.
http://en.wikipedia.org/wiki/Hessian_matrix
http://en.wikipedia.org/wiki/Taylor_series
The Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point.
(As the degree of the Taylor polynomial rises, it approaches the correct function.
http://en.wikipedia.org/wiki/Newton%27s_method
http://en.wikipedia.org/wiki/Quasi-Newton_method
227p
5.5.1 The Perceptron Criterion Function
http://en.wikipedia.org/wiki/Perceptron
Perceptron criterion is piecewise linear and acceptable for gradient descent.
229p
5.5.2 Convergence Proof for Single-Sample Correction
http://en.wikipedia.org/wiki/Convergence_of_random_variables
We shall modify the weight vector whenever it misclassifies a single sample.
For the purposes of the convergence proof, repeat the samples cyclically.
: We must store and revisit all of the training patterns.
: We shall only change the weight vector when there is an error
232p
5.5.3 Some Direct Generalizations
Algorithm 5: Variable-Increment Percceptron with Margin
correction with a variable increment and a margin
Algorithm 6: Batch Variable Increment Perceptron
the trajectory of the weight vector is smoothed
Algorithm 7: Balanced Winnow
for separable data the gap determined by the two contituent weight vectors can naver increase in size. -> a convergence proof
http://en.wikipedia.org/wiki/Winnow_(algorithm)
235p
5.6.1 The Descent Algorithm
a squared error function:
- The gradient is countinuous(, whereas the gradient of a Perceptron criterion function is not.)
- to search a smoother surface
- so smoother near the boundary of the solution region (that the sequence of weight vectors can converge to a point on the boundary)
- the gradient reaches the boundary point when the weight vector is zero vector
- dominated by the longest sample vectors
a squared error with margin:
- never negative, (zero iff weighted samples misclassified by the weight is equal to or more than the margin)
Algorithm 8: Batch Relaxation with Margin
Algorithm9: Single-Sample Relaxation with Margin
relaxation
under-relaxation
over-relaxation
237p
5.6.2 Convergence Proof
238p
error-correcting procedures to call for a modification of the weight vector when and only when an error is encountered
The length of the weight vectors produced by the fixed-increment rule are bounded (, fluctuating near some limiting value).
If the components of the samples are integer-valued, the fixed-increment procedure yields a finite-state process.
Averaging the weight vectors can reduce the risk of obtaining a bad solution.
5.8.4 The Widrow-Hoff or LMS Procedure
http://en.wikipedia.org/wiki/Least_mean_squares_filter
5.8.5 Stochastic Approximation Methods
http://en.wikipedia.org/wiki/Stochastic_gradient_descent
5.12 Multicategory Generalizations
5.12.1 Kesler's construction
to convert many multicategory error-correction procedures to two-category procedures for the purpose of obtaining a convergence proof
5.12.2 Convergence of the Fixed-Increment Rule
5.12.3 Generalizaions for MSE Procedures
5.1 Introduction
http://en.wikipedia.org/wiki/Statistical_classification
http://en.wikipedia.org/wiki/Linear_classifier
http://en.wikipedia.org/wiki/Linear_discriminant_analysis
linear discriminant functions
1) linear in the components of a feature vector
2) linear in some given set of functions of a feature vector
finding a linear discriminant function
-> minimizing a criterion fuction - sample risk / training error
: "A small training error does not guarantee a small test error."
=> to derive the minimum-risk linear discriminant
-> "convergence properties & computational complexities" of gradient descent procedures for minimizing criterion functions
216p
5.2 Linear Discriminant Functions and Decision Surfaces
weight vector
bias / threshold weight
bias / threshold weight
5.2.1 The Two-Category Case
"The effective input is the sum of each input feauture value multiplied by its corresponding weight."
http://en.wikipedia.org/wiki/Hyperplane
"The weight vector is normal to any vector lying in the hyperplane."
218p
5.2.2 The Multicategory Case
A linear machine divides the feature space into c decision regions.
The linear machines are most suitable for problems for which the conditional densities are unimodal.
219p
"The effective input is the sum of each input feauture value multiplied by its corresponding weight."
http://en.wikipedia.org/wiki/Hyperplane
"The weight vector is normal to any vector lying in the hyperplane."
A linear discriminant function divides the feature space by a hyperplane decision surface.
The orientation of the surface is determined by the normal vector w.
The location of the surface is determined by the bias w_0.
The discriminant function g(x) is proportional to the signed distance from x to the hyperplane.
The orientation of the surface is determined by the normal vector w.
The location of the surface is determined by the bias w_0.
The discriminant function g(x) is proportional to the signed distance from x to the hyperplane.
218p
5.2.2 The Multicategory Case
A linear machine divides the feature space into c decision regions.
If the decision regions are contiguous, the boundary between them is a portion of the hyperplane.
With the linear machine, the decision boundary is determined by the differences of the weight vectors.
There are c(c-1)/2 paris of regions.
With the linear machine, the decision boundary is determined by the differences of the weight vectors.
There are c(c-1)/2 paris of regions.
The linear machines are most suitable for problems for which the conditional densities are unimodal.
219p
5.3 Generalized Linear Discriminant Functions
quadratic discriminant function
2008/11/05 - [Method/Nature] - Jonathan Richard Shewchuk - An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
http://en.wikipedia.org/wiki/Quadratic_classifier
http://en.wikipedia.org/wiki/Quadric
The separating surface defined by the quadratic discriminant function is a second-degree or hyperquadric surface.
http://en.wikipedia.org/wiki/Hypersphere
http://en.wikipedia.org/wiki/Hyperboloid
http://en.wikipedia.org/wiki/Scaling_(geometry)
if the scaled matrix is
i) the positive multiple of the identity matrix
=> hypershere
ii) positive definete
=> hyperellipsoid
iii) some of eigenvalues of it are positive and others negative
=> various hyperhyperboloids
==> general multivariate Gaussian case
if the scaled matrix is
i) the positive multiple of the identity matrix
=> hypershere
ii) positive definete
=> hyperellipsoid
iii) some of eigenvalues of it are positive and others negative
=> various hyperhyperboloids
==> general multivariate Gaussian case
polynomial discriminant function
-> generalized linear discriminant function
: not linear in the feature vector x but linear in arbitrary functions y (phi function) of x
-> generalized linear discriminant function
: not linear in the feature vector x but linear in arbitrary functions y (phi function) of x
The mapping from x-space to y-space comes to find a homogeneous linear discriminant function.
Even with relatively simple functions y(x), decision surfaces induced in an x-space can be fairly complex.
Even with relatively simple functions y(x), decision surfaces induced in an x-space can be fairly complex.
the curse of dimensionality
-> the number of samples should be not less than the number of degrees of freedom. (-> ch.9)
http://en.wikipedia.org/wiki/Big_O_notation
http://en.wikipedia.org/wiki/Margin_(machine_learning)
The margin of a single data point is defined to be the distance from the data point to a decision boundary.
cf. VC dimension Tutorial Slides by Andrew Moore
http://en.wikipedia.org/wiki/Artificial_neural_network
http://en.wikipedia.org/wiki/Multilayer_perceptron
augmented feature vector, y
augmented weight vector, a
The hyperplane desicion surface in y-space passes through the origin in y-space.
The mapping from d-dim x-space to (d+1)-dim y-space preserves all distance relationships among samples.
-> The distance from y to the transformed hyperplane is less than or equal to the distance from x to the original hyperplane.
The mapping from d-dim x-space to (d+1)-dim y-space preserves all distance relationships among samples.
-> The distance from y to the transformed hyperplane is less than or equal to the distance from x to the original hyperplane.
223p
5.4 The Two-Category Linearly Separable Case
separating vector / solution vector (to normalize samples in the feature space)
http://en.wikipedia.org/wiki/Weight_(representation_theory)
5.4.1 Geometry and Terminology
The linear discriminant equation defines a hyperplane through the origin of weight space having y_i as a normal vector.
The solution vector must be on the positivie side of every hyperplane and lie in the intersection of n half-spaces. -> solution region
-> To find a solution vector closer to the "middle" of the solution region
-> margin
The solution vector must be on the positivie side of every hyperplane and lie in the intersection of n half-spaces. -> solution region
-> To find a solution vector closer to the "middle" of the solution region
-> margin
224p
5.4.2 Gradient Descent Procedures
To define a criterion function J(a) that is minimized if a is a solution vector
(1) BGD = Basic Gradient Descent
The next weight vector is obtained by moving some distance from the former one in the direction of steepest descent, along the negative of the gradient of a criterion function.
(2) Newton's algorithm
- To minimize the second-order expansion for the quadratic error functions
- To leads to greater improvement per step
- computational burden of inverting the Hessian matrix
(1) BGD = Basic Gradient Descent
The next weight vector is obtained by moving some distance from the former one in the direction of steepest descent, along the negative of the gradient of a criterion function.
(2) Newton's algorithm
- To minimize the second-order expansion for the quadratic error functions
- To leads to greater improvement per step
- computational burden of inverting the Hessian matrix
learning rate (to set the step size)
http://en.wikipedia.org/wiki/Gradient_descent
The (negative) gradient at a point is orthogonal to the contour line going through that point.
http://en.wikipedia.org/wiki/Hessian_matrix
http://en.wikipedia.org/wiki/Taylor_series
The Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point.
(As the degree of the Taylor polynomial rises, it approaches the correct function.
http://en.wikipedia.org/wiki/Newton%27s_method
http://en.wikipedia.org/wiki/Quasi-Newton_method
227p
5.5 Minimizing The Perception Criterion Function
5.5.1 The Perceptron Criterion Function
http://en.wikipedia.org/wiki/Perceptron
Perceptron criterion function
- never negative, being zero only if a is a solution vector or if a is on the decision boundary.
- propotional to the sum of the distances from the misclassified samples to the decision boundary
- never negative, being zero only if a is a solution vector or if a is on the decision boundary.
- propotional to the sum of the distances from the misclassified samples to the decision boundary
Perceptron criterion is piecewise linear and acceptable for gradient descent.
batch training
batch Perceptron algorithm:
The next weight vector is obtained by adding some multiple of the sum of the misclassified samples to the present weight vector.
batch Perceptron algorithm:
The next weight vector is obtained by adding some multiple of the sum of the misclassified samples to the present weight vector.
229p
5.5.2 Convergence Proof for Single-Sample Correction
http://en.wikipedia.org/wiki/Convergence_of_random_variables
We shall modify the weight vector whenever it misclassifies a single sample.
For the purposes of the convergence proof, repeat the samples cyclically.
: We must store and revisit all of the training patterns.
: We shall only change the weight vector when there is an error
Fixed-Increment Single-Sample Perceptron
The correction is moving the weight vector in a good direction, until the samples are linearly separable.
Fig.5.13
A correction which is proportional to the pattern vector is added to the weight vector.
Each correction brings the weight vector closer to the solution region.
The correction is moving the weight vector in a good direction, until the samples are linearly separable.
Fig.5.13
A correction which is proportional to the pattern vector is added to the weight vector.
Each correction brings the weight vector closer to the solution region.
Theorem 5.1. Perceptron Convergence
If training samples are linearly separable, then the sequence of weight vectors given by Fixed-Increment Single-Sample Perceptron algorithm will terminate at a solution vector.
If training samples are linearly separable, then the sequence of weight vectors given by Fixed-Increment Single-Sample Perceptron algorithm will terminate at a solution vector.
232p
5.5.3 Some Direct Generalizations
Algorithm 5: Variable-Increment Percceptron with Margin
correction with a variable increment and a margin
Algorithm 6: Batch Variable Increment Perceptron
the trajectory of the weight vector is smoothed
Algorithm 7: Balanced Winnow
for separable data the gap determined by the two contituent weight vectors can naver increase in size. -> a convergence proof
http://en.wikipedia.org/wiki/Winnow_(algorithm)
235p
5.6 Relaxation Procedures
5.6.1 The Descent Algorithm
a squared error function:
- The gradient is countinuous(, whereas the gradient of a Perceptron criterion function is not.)
- to search a smoother surface
- so smoother near the boundary of the solution region (that the sequence of weight vectors can converge to a point on the boundary)
- the gradient reaches the boundary point when the weight vector is zero vector
- dominated by the longest sample vectors
a squared error with margin:
- never negative, (zero iff weighted samples misclassified by the weight is equal to or more than the margin)
Algorithm 8: Batch Relaxation with Margin
Algorithm9: Single-Sample Relaxation with Margin
relaxation
under-relaxation
over-relaxation
237p
5.6.2 Convergence Proof
238p
5.7 Nonseparable Behavior
error-correcting procedures to call for a modification of the weight vector when and only when an error is encountered
The length of the weight vectors produced by the fixed-increment rule are bounded (, fluctuating near some limiting value).
If the components of the samples are integer-valued, the fixed-increment procedure yields a finite-state process.
Averaging the weight vectors can reduce the risk of obtaining a bad solution.
5.8 Minimum Squared-error Procedures
5.8.4 The Widrow-Hoff or LMS Procedure
http://en.wikipedia.org/wiki/Least_mean_squares_filter
5.8.5 Stochastic Approximation Methods
http://en.wikipedia.org/wiki/Stochastic_gradient_descent
5.9 The Ho-Koshyap Procedures
5.11 Support Vector Machines
5.12 Multicategory Generalizations
5.12.1 Kesler's construction
to convert many multicategory error-correction procedures to two-category procedures for the purpose of obtaining a convergence proof
5.12.2 Convergence of the Fixed-Increment Rule
5.12.3 Generalizaions for MSE Procedures
'@GSMC > 김경환: Pattern Recognition' 카테고리의 다른 글
ch.7 Stochastic Methods (0) | 2008.12.03 |
---|---|
ch.6 Multilayer Neural Networks (0) | 2008.11.21 |
ch.4 Nonparametric Techniques (0) | 2008.10.21 |
ch.3 Maximum-Likelihood and Bayesian Parameter Estimation (0) | 2008.10.08 |
ch.2 Bayesian Decision Theory (0) | 2008.09.25 |