Hidden Markov Model (HMM) Toolbox for Matlab
Written by Kevin Murphy, 1998.
Last updated: 7 June 2004.
The interface has changed!
This documentation refers to the version of 4 May 2003 or later.
This toolbox supports inference and learning for
HMMs with discrete outputs (dhmm's), Gaussian outputs (ghmm's),
or mixtures of Gaussians output (mhmm's).
The Gaussians can be full, diagonal, or spherical (isotropic).
It also supports discrete inputs, as in a POMDP.
The inference routines support filtering, smoothing, and fixed-lag smoothing.
For more general models, please see my
Bayes Net Toolbox.
What is an HMM?
An HMM is a Markov chain, where each state generates an
observation. You only see the observations, and the goal is to infer
the hidden state sequence.
For example, the hidden states may represent words or phonemes,
and the observations represent the acoustic signal.
Please see recommended reading for details.
The script dhmm_em_demo.m gives an example of how to learn an HMM
with discrete outputs. Let there be Q=2 states and O=2 output
symbols. We create random stochastic matrices as follows.
O = 3;
Q = 2;
prior0 = normalise(rand(Q,1));
transmat0 = mk_stochastic(rand(Q,Q));
obsmat0 = mk_stochastic(rand(Q,O));
Now we sample nex=20 sequences of length T=10 each from this model, to
use as training data.
T=10;
nex=20;
data = dhmm_sample(prior0, transmat0, obsmat0, nex, T);
Now we make a random guess as to what the parameters are,
prior1 = normalise(rand(Q,1));
transmat1 = mk_stochastic(rand(Q,Q));
obsmat1 = mk_stochastic(rand(Q,O));
and improve our guess using 5 iterations of EM...
[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);
LL(t) is the log-likelihood after iteration t, so we can plot the
learning curve.
To evaluate the log-likelihood of a trained model given test data,
proceed as follows:
loglik = dhmm_logprob(data, prior, transmat, obsmat)
Note: the discrete alphabet is assumed to be {1, 2, ..., O},
where O = size(obsmat, 2). Hence data cannot contain any 0s.
To classify a sequence into one of k classes, train up k HMMs, one per
class, and then compute the log-likelihood that each model gives to
the test sequence; if the i'th model is the most likely, then declare
the class of the sequence to be class i.
First you need to evaluate B(i,t) = P(y_t | Q_t=i) for all t,i:
B = multinomial_prob(data, obsmat);
Then you can use
[path, loglik] = viterbi_path(prior, transmat, B)
Let us generate nex=50 vector-valued sequences of length T=50; each
vector has size O=2.
O = 2;
T = 50;
nex = 50;
data = randn(O,T,nex);
Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states
using K-means.
M = 2;
Q = 2;
left_right = 0;
prior0 = normalise(rand(Q,1));
transmat0 = mk_stochastic(rand(Q,Q));
[mu0, Sigma0] = mixgauss_init(Q*M,data, cov_type);
mu0 = reshape(mu0, [O Q M]);
Sigma0 = reshape(Sigma0, [O O Q M]);
mixmat0 = mk_stochastic(rand(Q,M));
Finally, let us improve these parameter estimates using EM.
[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...
mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 2);
Since EM only finds a local optimum, good initialisation is crucial.
The initialisation procedure illustrated above is very crude, and is probably
not adequate for real applications...
Click here
for a real-world example of EM with mixtures of Gaussians using BNT.
It is possible for p(x) > 1 if p(x) is a probability density
function, such as a Gaussian. (The requirements for a density are
p(x)>0 for all x and int_x p(x) = 1.) In practice this usually means your
covariance is shrinking to a point/delta function, so you should
increase the width of the prior
(see below),
or constrain the matrix to be spherical
or diagonal, or clamp it to a large fixed constant (not learn it at all).
It is also very helpful to ensure the
components of the data vectors have small and comparable magnitudes
(use e.g., KPMstats/standardize).
This is a well-known pathology of maximum likelihood estimation for
Gaussian mixtures: the global optimum may place one mixture component
on a single data point, and give it 0 covariance, and hence infinite
likelihood. One usually relies on the fact that EM cannot find the
global optimum to avoid such pathologies.
Since I implicitly add a prior to every covariance matrix
(see below), what increases is loglik + log(prior),
but what I print is just loglik, which may occasionally decrease.
This suggests that one of your mixture components is not getting
enough data. Try a better initialization or fewer clusters (states).
Estimates of the covariance matrix often become singular if you have
too little data, or if too few points are assigned to a cluster center
due to a bad initialization of the means.
In this case, you should constrain the covariance to
be spherical or diagonal, or adjust the prior (see below), or try a
better initialization.
Buried inside of KPMstats/mixgauss_Mstep you will see that cov_prior
is initialized to 0.01*I. This is added to the maximum likelihood
estimate after every M step.
To change this, you will need to modify the
mhmm_em function so it calls mixgauss_Mstep with a different value.
To classify a sequence (e.g., of speech) into one of k classes (e.g.,
the digits 0-9), proceed as in the DHMM case above,
but use the following procedure to compute likelihood:
loglik = mhmm_logprob(data, prior, transmat, mu, Sigma, mixmat);
First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:
B = mixgauss_prob(data, mu, Sigma, mixmat);
Finally, use
[path, loglik] = viterbi_path(prior, transmat, B);
This is just like the mixture of Gaussians case,
except we have M=1, and hence there is no mixing matrix.
For some applications (e.g., reinforcement learning/ adaptive
control), it is necessary to learn a model online.
The script dhmm_em_online_demo gives an example of how to do this.
- A tutorial on Hidden Markov Models and
selected applications in speech recognition,
L. Rabiner, 1989, Proc. IEEE 77(2):257--286.
- What
HMMs can do, Jeff Bilmes, U. Washington Tech Report, Feb 2002
-
Markovian Models for Sequential Data,
Y. Bengio, Neural Computing Surveys 2, 129--162, 1999.
-
Probabilistic independence networks for hidden Markov models
P. Smyth, D. Heckerman, and M. Jordan,
Neural Computation , vol.9, no. 2, 227--269, 1997.
- Notes from
Dan Ellis' class on speech processing at Columbia (2002).
I snagged the excellent notes for
lecture 10
and
Lecture 11.
- Bourlard's
EPFL matlab lab manual on HMMs.
- Bibliography on
HMMs (2001)
- Bookmarks
on HMMs
-
Machine Learning for Sequential Data: A
Review.
Tom Dietterich. In T. Caelli (Ed.) Lecture Notes in Computer
Science (2002).