Gaussian Mixture Models
Gaussian Mixture Models (GMMs) are a probabilistic model that assumes all data points are generated from a mixture of
where:
is a -dimensional observation vector are the mixing coefficients (weights) satisfying and is the PDF of the -th multivariate Gaussian component with mean vector and covariance matrix
In practice, GMMs are widely applied for multivariate density estimation, clustering, and dimensionality reduction from data [11]. For univariate density estimation, we refer to Kernel Density Estimation.
Expectation-Maximization Algorithm for GMMs
One way to find the parameters of a GMM from a set of samples is to use the Expectation-Maximization (EM) algorithm. Here, we show the basic steps to fit a GMM to data using the EM algorithm based on Ref.[11]. The EM algorithm iteratively refines the parameters of the GMM by alternating between two steps:
Expectation Step (E-step): Calculate the expected value of the latent variables given the current parameters.
Maximization Step (M-step): Update the parameters to maximize the expected log-likelihood found in the E-step.
Given a dataset
The EM algorithm introduces latent variables
Expectation Step
Compute the posterior probabilities (responsibilities)
Maximization Step
Update the parameters using the computed responsibilities, where
Algorithm Convergence
The algorithm terminates when the change in log-likelihood between iterations falls below a predefined threshold
Implementation
In UncertaintyQuantification.jl, a GMM can be fitted to data using the GaussianMixtureModel
function, which implements the EM algorithm described above. The function takes a DataFrame
containing the samples and the number of components k
as input. Optionally, one can set the maximum number of iterations and tolerance. The GMM is constructed as:
# Generate sample data with two clusters
df = DataFrame(x1=randn(100), x2=2*randn(100))
k = 2
gmm = GaussianMixtureModel(df, k) # maximum_iterations = 100, tolerance=1e-4
This returns a MultivariateDistribution
object. The fitted mixture model, constructed using the EM algorithm, is stored as a Distributions.MixtureModel
from Distributions.jl
in the field gmm.d
:
gmm.d
MixtureModel{FullNormal}(K = 2)
components[1] (prior = 0.2410): FullNormal(
dim: 2
μ: [-0.33478082938636566, 1.8957939141962368]
Σ: [0.8357434100638619 0.04498446782817538; 0.04498446782817538 4.092943397667619]
)
components[2] (prior = 0.7590): FullNormal(
dim: 2
μ: [0.2521899250337857, -0.4880932059530853]
Σ: [1.2425497523016813 0.7430781222970831; 0.7430781222970831 3.34328079023627]
)
Since the GMM is returned as a MultivariateDistribution
, we can perform sampling and evaluation of the PDF the same way as for other (multivariate) random variables. For a more detailed explanation, we refer to the Gaussian Mixture Model Example.
Alternative mixture model construction
(Gaussian) Mixture models constructed with other packages can also be used to construct a MultivariateDistribution
, as long as they return a Distributions.MixtureModel
.