블로그 이미지
Leeway is... the freedom that someone has to take the action they want to or to change their plans.
maetel

Notice

Recent Post

Recent Comment

Recent Trackback

Archive

calendar

1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
  • total
  • today
  • yesterday

Category

350p
7.1 Introduction

search for finding acceptable model parameters

stochastic methods relied on randomness
(1) Boltzmann learning (from statistical mechanics in physics)
(2) genetic algorithms (from the mathematical theory of evolution in biology)
: preferable in complex problems
: high computational burden

351p
7.2 Stochastic Search

If we suggest an associated interaction energy of each pair of magnets,
then, to optimize the energy of the full system, we are to find the configuration of states of the magnets.
As the temperature is lowered, the system has increased probability of finding the optimum configuration.

successful in a wide range of energy functions or energy landscapes,
unlikely so in cases such as the "golf course landscape".


352p
7.2.2 The Boltzmann Factor

Boltzmann factor
http://en.wikipedia.org/wiki/Boltzmann_factor

partition function for a normalization constant
http://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics)

http://en.wikipedia.org/wiki/Boltzmann_distribution


the # of configurations = 2^N

Fig.7.1 - Boltzmann networks is to indicate the state fo each node
: The optimization problem is to find a configuration (i.e., assignment of all s_i) that minimizes the energy.

Fig.7.2
: Simulated annealing method uses randomness, governed by a control parameter or "temperature" T.

The number of states declines exponentially with increasing energy.

Because of the statistical independence of the magnets, for large N the probability of finding the state in energy E also decays exponentially.

The dependence of the probability upon T in the Boltzmann factor:
At high T, the probability is distributed roughly evenly among all configurations, while at low T it is concetrated at the lowest-energy configurations.

In the case of large N, the number of configurations decays exponentially with the energy of the configuration.


> Simulated Annealing Algorithm
1) initiate randomized states & select a high initial temperature T
(in the simulation, T is a control parameter that will control the randomness)
2) choose randomly a node i with its state s_i = +1
3) calculate the system energy in this configuration
4) recalculate the energy for a candidate new state s_i = -1
5) accept this change in state if this candidate state has a lower energy, or if the energy is higher, accept this with a probability from the Boltzmann factor.
6) poll (select and test) the nodes randomly several times and set their states
7) lower the temperature and repeat the polling
8) simluated annealing terminates when the temperature is very low (near zero)
(if the cooling has been sufficiently slow, the system has a high probability of being in a low-energy state)

The occational acceptance of a state that is energetically less favorable allows the system to jump out of unacceptable local energy minima.

 
Algorithm 1. Stochastic Simulated Annealing





posted by maetel

SNNS (Stuttgart Neural Network Simulator)
is a software simulator for neural networks on Unix workstations developed at the Institute for Parallel and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The goal of the SNNS project is to create an efficient and flexible simulation environment for research on and application of neural nets.

http://en.wikipedia.org/wiki/SNNS

Petron, E. 1999. Stuttgart Neural Network Simulator: Exploring connectionism and machine learning with SNNS. Linux J. 1999, 63es (Jul. 1999), 2.

Neural Network Toolbox™ extends MATLAB®
with tools for designing, implementing, visualizing, and simulating neural networks.



282p

6.1 Introduction

LMS (Least mean squares) algorithms

multilayer neural networks / multilayer Perceptrons
The parameters governing the nonlinear mapping are learned at the same time as those governing the linear discriminant.

http://en.wikipedia.org/wiki/Multilayer_perceptron

http://en.wikipedia.org/wiki/Neural_network

the number of layers of units
(eg. the two-layer networks = single layer networks)

Multilayer neural networks implement linear discriminants, but in a space where the inputs have been mapped nonlinearly.


backpropagation algorithm / generalized delta rule
: a natural extension of the LMS algorithm
- the intuitive graphical representation & the simplicity of design of models
- the conceptual and algorithmic simplicity

network architecture / topology
- neural net classification
- heuristic model selection (through choices in the number of hidden layers, units, feedback connections, and so on.

regularization
= complexity adjustment



284p

6.2 Feedforward Operation and Classification

 

http://en.wikipedia.org/wiki/Artificial_neural_network

bias unit
the function of units : "neurons"
the input units : the components of a feature vector
signals emiited by output units : the values of the discriminant functions used for classification
hidden units : the weighted sum of its inputs

-> net activation : the inner product of the inputs with the weights at the hidden unit

the input-to-hidden layer weights : "synapses" -> "synaptic weights"

activation function : "nonlinearity" of a unit

 

"Each hidden unit emits an output that is a nonlinear function of its activation."

"Each ouput unit computes its net activation based on the hidden unit signals."


http://en.wikipedia.org/wiki/Feedforward_neural_network


286p
6.2.1 General Feedforward Operation

"Given sufficient number of hidden units of a general type, any function can be so represented." -> expressive power

287p
6.2.2 Expressive Power of Multilayer Networks

"Any desired function can be implemented by a three-layer network."
-> but, the problems of designing and training neural networks


288p
6.3 Backpropagation Algorithm

http://en.wikipedia.org/wiki/Backpropagation

- the problem of setting the weights (based on training patterns and the desired output)
- supervised training of multilayer neural networks
- the natural extension of the LMS algorithm for linear systems
- based on gradient descent

credit assignment problem
: no explicit teacher to state what the hidden unit's output should be

=> The power of backpropagation to calculate an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights

sigmoid


6.3.1 Network Learning

The final signals emitted by the network are used as discriminant functions for classification. During network training, these output signals are compared with a teaching or target vector t, and any difference is used in training the weights throughout the network

training error -> LMS algorithm
gradient descent -> the weights are changed in a direction that will reduce the error


6.3.2 Training Protocol


6.3.3 Learning Curves

6.4 Error Surfaces




6.8.1 Activation Function

posted by maetel
215p
5.1 Introduction

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Linear_classifier

http://en.wikipedia.org/wiki/Linear_discriminant_analysis


linear discriminant functions
1) linear in the components of a feature vector
2) linear in some given set of functions of a feature vector

finding a linear discriminant function
-> minimizing a criterion fuction - sample risk / training error
: "A small training error does not guarantee a small test error."
=> to derive the minimum-risk linear discriminant

-> "convergence properties & computational complexities" of gradient descent procedures for minimizing criterion functions


216p
5.2 Linear Discriminant Functions and Decision Surfaces

weight vector
bias / threshold weight

5.2.1 The Two-Category Case

"The effective input is the sum of each input feauture value multiplied by its corresponding weight."

http://en.wikipedia.org/wiki/Hyperplane

"The weight vector is normal to any vector lying in the hyperplane."

A linear discriminant function divides the feature space by a hyperplane decision surface.

The orientation of the surface is determined by the normal vector w.
The location of the surface is determined by the bias w_0.

The discriminant function g(x) is proportional to the signed distance from x to the hyperplane.


218p
5.2.2 The Multicategory Case

A linear machine divides the feature space into c decision regions.

If the decision regions are contiguous, the boundary between them is a portion of the hyperplane.

With the linear machine, the decision boundary is determined by the differences of the weight vectors.

There are c(c-1)/2 paris of regions.


The linear machines are most suitable for problems for which the conditional densities are unimodal.


219p
5.3 Generalized Linear Discriminant Functions
 

http://en.wikipedia.org/wiki/Scaling_(geometry)
if the scaled matrix is
i) the positive multiple of the identity matrix
    => hypershere
ii) positive definete
    => hyperellipsoid
iii) some of eigenvalues of it are positive and others negative
    => various hyperhyperboloids
==> general multivariate Gaussian case

 
polynomial discriminant function
-> generalized linear discriminant function
: not linear in the feature vector x but linear in arbitrary functions y (phi function) of x

The mapping from x-space to y-space comes to find a homogeneous linear discriminant function.

Even with relatively simple functions y(x), decision surfaces induced in an x-space can be fairly complex.

the curse of dimensionality
-> the number of samples should be not less than the number of degrees of freedom. (-> ch.9)

http://en.wikipedia.org/wiki/Big_O_notation


http://en.wikipedia.org/wiki/Margin_(machine_learning)
The margin of a single data point is defined to be the distance from the data point to a decision boundary.
cf. VC dimension Tutorial Slides by Andrew Moore


http://en.wikipedia.org/wiki/Artificial_neural_network

http://en.wikipedia.org/wiki/Multilayer_perceptron


augmented feature vector, y
augmented weight vector, a

The hyperplane desicion surface in y-space passes through the origin in y-space.

The mapping from d-dim x-space to (d+1)-dim y-space preserves all distance relationships among samples.
-> The distance from y to the transformed hyperplane is less than or equal to the distance from x to the original hyperplane.



223p
5.4 The Two-Category Linearly Separable Case


separating vector / solution vector (to normalize samples in the feature space)

http://en.wikipedia.org/wiki/Weight_(representation_theory)

5.4.1 Geometry and Terminology


The linear discriminant equation defines a hyperplane through the origin of weight space having y_i as a normal vector.

The solution vector must be on the positivie side of every hyperplane and lie in the intersection of n half-spaces. -> solution region

-> To find a solution vector closer to the "middle" of the solution region
-> margin


224p
5.4.2 Gradient Descent Procedures

To define a criterion function J(a) that is minimized if a is a solution vector

(1) BGD = Basic Gradient Descent
The next weight vector is obtained by moving some distance from the former one in the direction of steepest descent, along the negative of the gradient of a criterion function.

(2) Newton's algorithm
- To minimize the second-order expansion for the quadratic error functions
- To leads to greater improvement per step
- computational burden of inverting the Hessian matrix

learning rate (to set the step size)

http://en.wikipedia.org/wiki/Gradient_descent
The (negative) gradient at a point is orthogonal to the contour line going through that point.

http://en.wikipedia.org/wiki/Hessian_matrix

http://en.wikipedia.org/wiki/Taylor_series
The Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point.
(As the degree of the Taylor polynomial rises, it approaches the correct function.

http://en.wikipedia.org/wiki/Newton%27s_method

http://en.wikipedia.org/wiki/Quasi-Newton_method



227p
5.5 Minimizing The Perception Criterion Function

5.5.1 The Perceptron Criterion Function

http://en.wikipedia.org/wiki/Perceptron

Perceptron criterion function
- never negative, being zero only if a is a solution vector or if a is on the decision boundary.
- propotional to the sum of the distances from the misclassified samples to the decision boundary

Perceptron criterion is piecewise linear and acceptable for gradient descent.


batch training
batch Perceptron algorithm:
The next weight vector is obtained by adding some multiple of the sum of the misclassified samples to the present weight vector.


229p
5.5.2 Convergence Proof for Single-Sample Correction

http://en.wikipedia.org/wiki/Convergence_of_random_variables

We shall modify the weight vector whenever it misclassifies a single sample.
For the purposes of the convergence proof, repeat the samples cyclically.
: We must store and revisit all of the training patterns.
: We shall only change the weight vector when there is an error

Fixed-Increment Single-Sample Perceptron
The correction is moving the weight vector in a good direction, until the samples are linearly separable.

Fig.5.13
A correction which is proportional to the pattern vector is added to the weight vector.
Each correction brings the weight vector closer to the solution region.

Theorem 5.1. Perceptron Convergence
If training samples are linearly separable, then the sequence of weight vectors given by Fixed-Increment Single-Sample Perceptron algorithm will terminate at a solution vector.


232p
5.5.3 Some Direct Generalizations

Algorithm 5: Variable-Increment Percceptron with Margin
correction with a variable increment and a margin

Algorithm 6: Batch Variable Increment Perceptron
the trajectory of the weight vector is smoothed

Algorithm 7: Balanced Winnow
for separable data the gap determined by the two contituent weight vectors can naver increase in size. -> a convergence proof
http://en.wikipedia.org/wiki/Winnow_(algorithm)


235p
5.6 Relaxation Procedures

5.6.1 The Descent Algorithm

a squared error function:
- The gradient is countinuous(, whereas the gradient of a Perceptron criterion function is not.)
- to search a smoother surface
- so smoother near the boundary of the solution region (that the sequence of weight vectors can converge to a point on the boundary)
- the gradient reaches the boundary point when the weight vector is zero vector
- dominated by the longest sample vectors

a squared error with margin:
- never negative, (zero iff weighted samples misclassified by the weight is equal to or more than the margin)

Algorithm 8: Batch Relaxation with Margin

Algorithm9: Single-Sample Relaxation with Margin

relaxation
under-relaxation
over-relaxation


237p
5.6.2 Convergence Proof


238p
5.7 Nonseparable Behavior


error-correcting procedures to call for a modification of the weight vector when and only when an error is encountered

The length of the weight vectors produced by the fixed-increment rule are bounded (, fluctuating near some limiting value).

If the components of the samples are integer-valued, the fixed-increment procedure yields a finite-state process.

Averaging the weight vectors can reduce the risk of obtaining a bad solution.



5.8 Minimum Squared-error Procedures


5.8.4 The Widrow-Hoff or LMS Procedure

http://en.wikipedia.org/wiki/Least_mean_squares_filter


5.8.5 Stochastic Approximation Methods

http://en.wikipedia.org/wiki/Stochastic_gradient_descent


5.9 The Ho-Koshyap Procedures


5.11 Support Vector Machines


5.12 Multicategory Generalizations

5.12.1 Kesler's construction

to convert many multicategory error-correction procedures to two-category procedures for the purpose of obtaining a convergence proof

5.12.2 Convergence of the Fixed-Increment Rule

5.12.3 Generalizaions for MSE Procedures


posted by maetel

161p

4.1 Introduction

nonparametric procedures (with arbitrary distribution and without the assumption that the forms of the underlying densities are known)

1) estimating the density functions from sample patterns -> designing the classfier
2) directly estimating the a posteriori probability -> the nearest-neighbor rule -> decision functions


4.2 Density Estimation



http://en.wikipedia.org/wiki/Density_estimation
the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population


i.i.d. = Independent and identically-distributed random variables
http://en.wikipedia.org/wiki/Iid
In probability theory, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each has the same probability distribution as the others and all are mutually independent.


http://en.wikipedia.org/wiki/Binomial_coefficient
\tbinom nk is the number of k-element subsets (the k-combinations) of an n-element set; that is, the number of ways that k things can be 'chosen' from a set of n things.


http://en.wikipedia.org/wiki/Probability_density_function
a function that represents a probability distribution in terms of integrals

kernel density estimation; Parzen window method
http://en.wikipedia.org/wiki/Parzen_window
a non-parametric way of estimating the probability density function of a random variable.
Given some data about a sample of a population, kernel density estimation makes it possible to extrapolate the data to the entire population.

http://en.wikipedia.org/wiki/Kernel_(statistics)
A kernel is a weighting function used in non-parametric estimation techniques. Kernels are used in kernel density estimation to estimate random variables' density functions, or in kernel regression to estimate the conditional expectation of a random variable.

http://en.wikipedia.org/wiki/Hypercube

http://en.wikipedia.org/wiki/Neural_networks

posted by maetel



3.2 Maximum-Likelihood Estimation

maximum-likelihood

http://en.wikipedia.org/wiki/Maximum_likelihood

a popular statistical method used for fitting a mathematical model to some data. The modeling of real world data using estimation by maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit.
The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922.

For a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them.

 

http://www.aistudy.com/math/likelihood.htm
어떤 가설 (hypothesis) H 에 대한 우도 (尤度, likelihood) 란, 어떤 시행의 결과 (Evidence) E 가 주어졌다 할 때, 만일 주어진 가설 H 가 참이라면, 그러한 결과 E 가 나올 정도는 얼마나 되겠느냐 하는 것이다. 즉  결과 E 가 나온 경우, 그러한 결과가 나올 수 있는 여러 가능한 가설들을 평가할 수 있는 측도가 곧 우도인 셈이다.

전문가시스템의 불확실성 (Uncertainty) 을 평가하기 위해 흔히 사용하는 베이즈 정리 (Bayes' Theorem) 에서는 사전확률에 새로운 증거를 대입하여 사후확률을 얻게 되는데, 사전확률을 부여함에 있어 자의성을 배제하기 어렵지만, 우도를 사용하여 그 자의성을 벗어나 훨씬 용이하게 사전확률을 계산해 내는 것이 가능하다 (전영삼 1993).

만일 어떤 가설에 대한 우도를 주어진 데이터가 그 가설을지지하는 정도로 해석을 한다 하면, 여러 가설 중 그 우도가 최대가 되는 가설을 선호함은 자연스러운 일이다. 즉 만일 그 가설이 어떤 모집단의 모수 (population parameter) 에 관한 가설이라고 하면, 바로 그 추정치를 해당 모집단에 관한 가장 적절한 추정치로서 선호할 수 있다는 것이다. 피셔에 있어 이와같은 원리를 이른 바 "최대우도의 원리 (Principle of Maximum Likelihood)" 라 부르며, 이와같은 원리에 따라 어떤 모수에 관한 가장 적절한 추정치 (Estimate) 를 구하는 방법을 이른 바 "최대우도의 방법 (Method of Maximum Likelihood) 이라 부른다 (전영삼 1990).



likelihood function

http://en.wikipedia.org/wiki/Likelihood_function

Informally, if "probability" allows us to predict unknown outcomes based on known parameters, then "likelihood" allows us to estimate unknown parameters based on known outcomes.
In a sense, likelihood works backwards from probability: given parameter B, we use the conditional probability P(A|B) to reason about outcome A, and given outcome A, we use the likelihood function L(B|A) to reason about parameter B. This mode of reasoning is formalized in Bayes' theorem:


probability density function
http://en.wikipedia.org/wiki/Probability_density_function
a function that represents a probability distribution in terms of integrals.


maximum a posteriori (MAP, posterior mode)

http://en.wikipedia.org/wiki/Maximum_a_posteriori
The method to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to Fisher's method of maximum likelihood (ML), but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.



covariance matrix

http://en.wikipedia.org/wiki/Covariance_matrix

http://mathworld.wolfram.com/Covariance.html
Covariance provides a measure of the strength of the correlation between two or more sets of random variates.


http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices



3.3 Bayesian Estimation


Bayesian Estimator

http://en.wikipedia.org/wiki/Bayesian_estimation
a Bayes estimator is an estimator or decision rule that maximizes the posterior expected value of a utility function or minimizes the posterior expected value of a loss function (also called posterior expected loss).

i) Parameter vector is considered to be a random variable.
ii) Training data allow us to convert a distribution on this variable into a posterior probability density.



Monte-Carlo simulation
http://en.wikipedia.org/wiki/Monte_Carlo_method#Monte_Carlo_Simulation_versus_.E2.80.9CWhat_If.E2.80.9D_Scenarios


Dirac delta function
http://en.wikipedia.org/wiki/Dirac_delta_function


expectation-maximization (EM)


http://en.wikipedia.org/wiki/Expectation-maximization_algorithm




3.10 Hidden Markov Model



http://en.wikipedia.org/wiki/Hidden_Markov_model


posted by maetel
20p
2.1 Introduction

state of nature

prior (probability)

http://en.wikipedia.org/wiki/Prior_probability
a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence
(The posterior probability is then the conditional probability of the variable taking the evidence into account. The posterior probability is computed from the prior and the likelihood function via Bayes' theorem.)

decision rule

probability mass function = pmf

http://en.wikipedia.org/wiki/Probability_mass_function
a function that gives the probability that a discrete random variable is exactly equal to some value
(A pmf differs from a probability density function (abbreviated pdf) in that the values of a pdf, defined only for continuous random variables, are not probabilities as such. Instead, the integral of a pdf over a range of possible values (a, b] gives the probability of the random variable falling within that range.)

probability density function = pdf

http://en.wikipedia.org/wiki/Probability_density_function
a function that represents a probability distribution in terms of integrals

class-conditional probability density function = state-conditional probability density
: the probability density function for x given that a state of nature is w

http://en.wikipedia.org/wiki/Conditional_probability
the probability of some event A, given the occurrence of some other event B
(Conditional probability is written P(A|B), and is read "the probability of A, given B".)


Bayes formula:
posterior = likelihood * prior / evidence

P(w_j) -- (x) --> P(w_j|x)
: By observing the value of x when we can convert the prior probability P(w_j) to the a posterior probability (or posterior) P(w_j|x), to measure the probability of the state of nature being w_j given that feautre value x

likelihood

evidence
: scale factor (to guarantee the posterior probabilities sum to one)

http://en.wikipedia.org/wiki/Bayes%27_Theorem

http://www.aistudy.com/pattern/parametric_gose.htm#_bookmark_3c54af0


Bayesian decision rule (for minimizing the probability of error)



24p
2.2 Bayesian Decision Theory - Continuous Features

feature vector

feature space

http://en.wikipedia.org/wiki/Feature_space
an abstract space where each pattern sample is represented as a point in n-dimensional space
(Its dimension is determined by the number of features used to describe the patterns. Similar samples are grouped together, which allows the use of density estimation for finding patterns.)

loss function (for an action)
cost function (for classification mistakes)

a probability determination -- loss function --> decision

risk: an expected loss

conditional risk

decision function

The dicision rule specifies the action.

Bayes decision procedure -> optimal performance
Bayes decision rule:
to minimize the overall risk, select the action for
the minimum conditional risk = R* : Bayes risk
-> the best performance


25p
2.2.1 Two-Category Classification

The loss incurred for making an error is greater than the loss incurred for being correct.


likelihood ratio

The Bayes decision rule can be interpreted as calling for deciding w_1 if the likelihood ratio exceeds a threshold value that is independent of the observation x.


26p
2.3 Minimum-error-rate Classification

to seek a decision rule that minimizes the probability of error, the error rate

symmetrical / zero-one loss function

for minimum error rate,
decide w_i if P(W_1|x) > P(w_j|x)

2.3.1 Minimax Criterion
2.3.2 Neyman-Pearson Criterion


29p
2.4 Classifiers, Discriminant Functions, and Decision Surfaces

2.4.1 The Multicategory Case

Fig. 2.5 The fuctional structure of a general statistical pattern classfier
input x -> discriminant functions g(x) + costs -> action (classification)

classifier
: a network or machine that computes c discriminant functions and selects the category corresponding to the largest discriminant

Bayes classifier
i) the maximum discriminant fn. <=> the minimum conditional risk
ii) for the minimum-error-rate,
the maximum discriminant fn. <=> the maximum posterior probability
iii) replacing the disciminant fn. by a monotonically increasing fn.

(28)

Decision rule divides the feature space into c decision regions which are separated by decision boundaries, surfaces in feature space where ties occur among the largest discriminant functions.


2.4.2 The Two-Category Case

dichotomizer


31p
2.5 Normal Density

the multivariate normal / Gaussian density

expected value

2.5.1 Univariate Density

expected value of x (: an average over the feature space)
(35)

expected squared deviation = variance
(36)


The entropy measures the fundamental uncertainty in the values of points selected randomly from a distribution.

The normal distribution has the maximum entropy of all distributions having a given mean and variance.

http://en.wikipedia.org/wiki/Central_limit_theorem
The central limit theorem (CLT) states that the sum of a sufficiently large number of identically distributed independent random variables each with finite mean and variance will be approximately normally distributed (Rice 1995). Formally, a central limit theorem is any of a set of weak-convergence results in probability theory. They all express the fact that any sum of many independent identically distributed random variables will tend to be distributed according to a particular "attractor distribution".

33p
2.5.2 Multivariate Density

covaraince matrix
The covariance matrix allows us to calculate the dispersion of the data in any direction, or in any subspce.

http://en.wikipedia.org/wiki/Covariance_matrix

http://en.wikipedia.org/wiki/Covariance
covariance is a measure of how much two variables change together (the variance is a special case of the covariance when the two variables are identical).
If two variables tend to vary together (that is, when one of them is above its expected value, then the other variable tends to be above its expected value too), then the covariance between the two variables will be positive. On the other hand, when one of them is above its expected value the other variable tends to be below its expected value, then the covariance between the two variables will be negative.
 
the center of the cluster - the mean vector
the shape of the cluster - the covaraince matrix


Whitening Transform
making the spectrum of eigenvalues of the transformed distribution uniform

http://en.wikipedia.org/wiki/Whitening_transform
The whitening transformation is a decorrelation method that converts the covariance matrix S of a set of samples into the identity matrix I. This effectively creates new random variables that are uncorrelated and have the same variances as the original random variables. The method is called the whitening transform because it transforms the input matrix closer towards white noise.
This can be expressed as  A_w = \Phi \Lambda^{-\frac{1}{2}}
where Φ is the matrix with the eigenvectors of "S" as its columns and Λ is the diagonal matrix of non-increasing eigenvalues.


hyperellipsoids
- principal axes

- Mahalanobis distance
http://en.wikipedia.org/wiki/Mahalanobis_distance

- volume


36p
2.6 Discriminant Functions for the Normal Density

2.6.1 case 1: covariance matrix = a contant times the identity matrix

equal-size hypersherical clusters

linear machine

The hyper plane is the perpendicular bisector of the line between the means

minimum-distance classfier

template-matching -> the nearest-neighbor algorithm

2.6.2 case 2: covariance matrices = identical but arbitrary



2.6.3 case 3: covariance matrix = arbitrary



2.9 Bayes Decision Theory - Discrete Features

posted by maetel
Pattern Recognition
; the act of tacking in raw data and making an action based on the "category" of the pattern
- evolving highly sophisticated neural and cognitive systems


Machine Perception
http://en.wikipedia.org/wiki/Machine_perception
: the ability of computing machines to sense and interpret images, sounds, or other contents of their environments, or of the contents of stored media
http://en.wikipedia.org/wiki/Machine_vision
machine vision most often requires also digital input/output devices and computer networks to control other manufacturing equipment such as robotic arms. Machine Vision is a subfield of engineering that encompasses computer science, optics, mechanical engineering, and industrial automation.

machine vision systems use digital cameras, smart cameras and image processing software to perform similar inspections.

Machine vision systems are programmed to perform narrowly defined tasks such as counting objects on a conveyor, reading serial numbers, and searching for surface defects.


http://en.wikipedia.org/wiki/Speech_recognition



Pattern Recognition

http://en.wikipedia.org/wiki/Pattern_recognition
Pattern recognition is a sub-topic of machine learning. It can be defined as

"the act of taking in raw data and taking an action based on the category of the data".

Most research in pattern recognition is about methods for supervised learning and unsupervised learning.

Pattern recognition aims to classify data (patterns) based on either a priori knowledge or on statistical information extracted from the patterns. The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate multidimensional space. This is in contrast to pattern matching, where the pattern is rigidly specified.



optical character recognition

http://en.wikipedia.org/wiki/Optical_character_recognition
the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text



pilot (adj) experimental


models - descriptions - mathematical in form (*분류할 대상의 종류들 classes 사이의 차이점을 수학적으로 기술한 것)


pattern classification

1) to hypothesize the class of the models

2) to process the sensed data to eliminate noise (not due to the models)

3) to choose the model that corresponds best for any sensed pattern


preprocessing - to simplify subsequent operations without losing relevant information

feature extraction - to reduce the data by measuring certain "features" or "properties"

classification - to evaluate the evidence presented and make a final decision


http://en.wikipedia.org/wiki/Information_flow

 

training sample (*feature로 가정한 변수(예. 길이)의 threshold를 설정하기 위하여 전체 입력 데이터 중에서 선택하여 변수에 대한 측정값을 얻는 데 사용하는 일부의 데이터)  


http://en.wikipedia.org/wiki/Data


cost




posted by maetel