블로그 이미지
Leeway is... the freedom that someone has to take the action they want to or to change their plans.
maetel

Notice

Recent Post

Recent Comment

Recent Trackback

Archive

calendar

1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
  • total
  • today
  • yesterday

Category

'pattern recognistion'에 해당되는 글 2건

  1. 2008.10.08 ch.3 Maximum-Likelihood and Bayesian Parameter Estimation
  2. 2008.09.25 ch.2 Bayesian Decision Theory



3.2 Maximum-Likelihood Estimation

maximum-likelihood

http://en.wikipedia.org/wiki/Maximum_likelihood

a popular statistical method used for fitting a mathematical model to some data. The modeling of real world data using estimation by maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit.
The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922.

For a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them.

 

http://www.aistudy.com/math/likelihood.htm
어떤 가설 (hypothesis) H 에 대한 우도 (尤度, likelihood) 란, 어떤 시행의 결과 (Evidence) E 가 주어졌다 할 때, 만일 주어진 가설 H 가 참이라면, 그러한 결과 E 가 나올 정도는 얼마나 되겠느냐 하는 것이다. 즉  결과 E 가 나온 경우, 그러한 결과가 나올 수 있는 여러 가능한 가설들을 평가할 수 있는 측도가 곧 우도인 셈이다.

전문가시스템의 불확실성 (Uncertainty) 을 평가하기 위해 흔히 사용하는 베이즈 정리 (Bayes' Theorem) 에서는 사전확률에 새로운 증거를 대입하여 사후확률을 얻게 되는데, 사전확률을 부여함에 있어 자의성을 배제하기 어렵지만, 우도를 사용하여 그 자의성을 벗어나 훨씬 용이하게 사전확률을 계산해 내는 것이 가능하다 (전영삼 1993).

만일 어떤 가설에 대한 우도를 주어진 데이터가 그 가설을지지하는 정도로 해석을 한다 하면, 여러 가설 중 그 우도가 최대가 되는 가설을 선호함은 자연스러운 일이다. 즉 만일 그 가설이 어떤 모집단의 모수 (population parameter) 에 관한 가설이라고 하면, 바로 그 추정치를 해당 모집단에 관한 가장 적절한 추정치로서 선호할 수 있다는 것이다. 피셔에 있어 이와같은 원리를 이른 바 "최대우도의 원리 (Principle of Maximum Likelihood)" 라 부르며, 이와같은 원리에 따라 어떤 모수에 관한 가장 적절한 추정치 (Estimate) 를 구하는 방법을 이른 바 "최대우도의 방법 (Method of Maximum Likelihood) 이라 부른다 (전영삼 1990).



likelihood function

http://en.wikipedia.org/wiki/Likelihood_function

Informally, if "probability" allows us to predict unknown outcomes based on known parameters, then "likelihood" allows us to estimate unknown parameters based on known outcomes.
In a sense, likelihood works backwards from probability: given parameter B, we use the conditional probability P(A|B) to reason about outcome A, and given outcome A, we use the likelihood function L(B|A) to reason about parameter B. This mode of reasoning is formalized in Bayes' theorem:


probability density function
http://en.wikipedia.org/wiki/Probability_density_function
a function that represents a probability distribution in terms of integrals.


maximum a posteriori (MAP, posterior mode)

http://en.wikipedia.org/wiki/Maximum_a_posteriori
The method to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to Fisher's method of maximum likelihood (ML), but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.



covariance matrix

http://en.wikipedia.org/wiki/Covariance_matrix

http://mathworld.wolfram.com/Covariance.html
Covariance provides a measure of the strength of the correlation between two or more sets of random variates.


http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices



3.3 Bayesian Estimation


Bayesian Estimator

http://en.wikipedia.org/wiki/Bayesian_estimation
a Bayes estimator is an estimator or decision rule that maximizes the posterior expected value of a utility function or minimizes the posterior expected value of a loss function (also called posterior expected loss).

i) Parameter vector is considered to be a random variable.
ii) Training data allow us to convert a distribution on this variable into a posterior probability density.



Monte-Carlo simulation
http://en.wikipedia.org/wiki/Monte_Carlo_method#Monte_Carlo_Simulation_versus_.E2.80.9CWhat_If.E2.80.9D_Scenarios


Dirac delta function
http://en.wikipedia.org/wiki/Dirac_delta_function


expectation-maximization (EM)


http://en.wikipedia.org/wiki/Expectation-maximization_algorithm




3.10 Hidden Markov Model



http://en.wikipedia.org/wiki/Hidden_Markov_model


posted by maetel
20p
2.1 Introduction

state of nature

prior (probability)

http://en.wikipedia.org/wiki/Prior_probability
a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence
(The posterior probability is then the conditional probability of the variable taking the evidence into account. The posterior probability is computed from the prior and the likelihood function via Bayes' theorem.)

decision rule

probability mass function = pmf

http://en.wikipedia.org/wiki/Probability_mass_function
a function that gives the probability that a discrete random variable is exactly equal to some value
(A pmf differs from a probability density function (abbreviated pdf) in that the values of a pdf, defined only for continuous random variables, are not probabilities as such. Instead, the integral of a pdf over a range of possible values (a, b] gives the probability of the random variable falling within that range.)

probability density function = pdf

http://en.wikipedia.org/wiki/Probability_density_function
a function that represents a probability distribution in terms of integrals

class-conditional probability density function = state-conditional probability density
: the probability density function for x given that a state of nature is w

http://en.wikipedia.org/wiki/Conditional_probability
the probability of some event A, given the occurrence of some other event B
(Conditional probability is written P(A|B), and is read "the probability of A, given B".)


Bayes formula:
posterior = likelihood * prior / evidence

P(w_j) -- (x) --> P(w_j|x)
: By observing the value of x when we can convert the prior probability P(w_j) to the a posterior probability (or posterior) P(w_j|x), to measure the probability of the state of nature being w_j given that feautre value x

likelihood

evidence
: scale factor (to guarantee the posterior probabilities sum to one)

http://en.wikipedia.org/wiki/Bayes%27_Theorem

http://www.aistudy.com/pattern/parametric_gose.htm#_bookmark_3c54af0


Bayesian decision rule (for minimizing the probability of error)



24p
2.2 Bayesian Decision Theory - Continuous Features

feature vector

feature space

http://en.wikipedia.org/wiki/Feature_space
an abstract space where each pattern sample is represented as a point in n-dimensional space
(Its dimension is determined by the number of features used to describe the patterns. Similar samples are grouped together, which allows the use of density estimation for finding patterns.)

loss function (for an action)
cost function (for classification mistakes)

a probability determination -- loss function --> decision

risk: an expected loss

conditional risk

decision function

The dicision rule specifies the action.

Bayes decision procedure -> optimal performance
Bayes decision rule:
to minimize the overall risk, select the action for
the minimum conditional risk = R* : Bayes risk
-> the best performance


25p
2.2.1 Two-Category Classification

The loss incurred for making an error is greater than the loss incurred for being correct.


likelihood ratio

The Bayes decision rule can be interpreted as calling for deciding w_1 if the likelihood ratio exceeds a threshold value that is independent of the observation x.


26p
2.3 Minimum-error-rate Classification

to seek a decision rule that minimizes the probability of error, the error rate

symmetrical / zero-one loss function

for minimum error rate,
decide w_i if P(W_1|x) > P(w_j|x)

2.3.1 Minimax Criterion
2.3.2 Neyman-Pearson Criterion


29p
2.4 Classifiers, Discriminant Functions, and Decision Surfaces

2.4.1 The Multicategory Case

Fig. 2.5 The fuctional structure of a general statistical pattern classfier
input x -> discriminant functions g(x) + costs -> action (classification)

classifier
: a network or machine that computes c discriminant functions and selects the category corresponding to the largest discriminant

Bayes classifier
i) the maximum discriminant fn. <=> the minimum conditional risk
ii) for the minimum-error-rate,
the maximum discriminant fn. <=> the maximum posterior probability
iii) replacing the disciminant fn. by a monotonically increasing fn.

(28)

Decision rule divides the feature space into c decision regions which are separated by decision boundaries, surfaces in feature space where ties occur among the largest discriminant functions.


2.4.2 The Two-Category Case

dichotomizer


31p
2.5 Normal Density

the multivariate normal / Gaussian density

expected value

2.5.1 Univariate Density

expected value of x (: an average over the feature space)
(35)

expected squared deviation = variance
(36)


The entropy measures the fundamental uncertainty in the values of points selected randomly from a distribution.

The normal distribution has the maximum entropy of all distributions having a given mean and variance.

http://en.wikipedia.org/wiki/Central_limit_theorem
The central limit theorem (CLT) states that the sum of a sufficiently large number of identically distributed independent random variables each with finite mean and variance will be approximately normally distributed (Rice 1995). Formally, a central limit theorem is any of a set of weak-convergence results in probability theory. They all express the fact that any sum of many independent identically distributed random variables will tend to be distributed according to a particular "attractor distribution".

33p
2.5.2 Multivariate Density

covaraince matrix
The covariance matrix allows us to calculate the dispersion of the data in any direction, or in any subspce.

http://en.wikipedia.org/wiki/Covariance_matrix

http://en.wikipedia.org/wiki/Covariance
covariance is a measure of how much two variables change together (the variance is a special case of the covariance when the two variables are identical).
If two variables tend to vary together (that is, when one of them is above its expected value, then the other variable tends to be above its expected value too), then the covariance between the two variables will be positive. On the other hand, when one of them is above its expected value the other variable tends to be below its expected value, then the covariance between the two variables will be negative.
 
the center of the cluster - the mean vector
the shape of the cluster - the covaraince matrix


Whitening Transform
making the spectrum of eigenvalues of the transformed distribution uniform

http://en.wikipedia.org/wiki/Whitening_transform
The whitening transformation is a decorrelation method that converts the covariance matrix S of a set of samples into the identity matrix I. This effectively creates new random variables that are uncorrelated and have the same variances as the original random variables. The method is called the whitening transform because it transforms the input matrix closer towards white noise.
This can be expressed as  A_w = \Phi \Lambda^{-\frac{1}{2}}
where Φ is the matrix with the eigenvectors of "S" as its columns and Λ is the diagonal matrix of non-increasing eigenvalues.


hyperellipsoids
- principal axes

- Mahalanobis distance
http://en.wikipedia.org/wiki/Mahalanobis_distance

- volume


36p
2.6 Discriminant Functions for the Normal Density

2.6.1 case 1: covariance matrix = a contant times the identity matrix

equal-size hypersherical clusters

linear machine

The hyper plane is the perpendicular bisector of the line between the means

minimum-distance classfier

template-matching -> the nearest-neighbor algorithm

2.6.2 case 2: covariance matrices = identical but arbitrary



2.6.3 case 3: covariance matrix = arbitrary



2.9 Bayes Decision Theory - Discrete Features

posted by maetel