Symmetric positive definite matrices are a very special type of matrix that often arise in practice. From a computational point of view, this class of matrix is very attractive because it is possible to decompose a symmetic positive definite matrix $A$ very efficiently into a single lower triangular matrix $G$ so that $A = GG^T$.
A matrix $A$ is positive definite if $x^T A x > 0$ for any nonzero $x \in \mathbb{R}$. This statement by itself is not terribly intuitive, so lets look at also look at an example of a $2 \times 2$ matrix
\[ A = \left(\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{matrix}\right) \]
If $A$ is symmetic positive definite (SPD) then
\[ \begin{aligned} x &= (1, 0)^T \Rightarrow x^T A x = a_{11} > 0 \\ x &= (0, 1)^T \Rightarrow x^T A x = a_{22} > 0 \\ x &= (1, 1)^T \Rightarrow x^T A x = a_{11} + 2a_{12} + a_{22} > 0 \\ x &= (1,-1)^T \Rightarrow x^T A x = a_{11} - 2a_{12} + a_{22} > 0 \\ \end{aligned} \]
The first two equations show that the diagonal entries of $A$ must be positive, and combining the last two equations imply $|a_{12}| \le (a_{11} + a_{22}) / 2$, that is that the matrix has much of its "mass" on the diagonal (note: this is not the same as the matrix being diagonally dominant, where $|a_{ii}| > \sum_{i=1...n,j \ne i} |a_{ij}|$). These two observations for our $2 \times 2$ matrix also apply for a general $n \times n$ SPD matrix. One of the very nice consequences of this "weighty" diagonal for SPD matrices is that it precludes the need for pivoting.
It can be shown that if $A$ is a SPD matrix, then the $LDL^T$ decomposition exists and that $D = \text{diag}(d_1, ..., d_n)$ has positive diagonal entries. Therefore, it is straightforward to see that $LDL^T$ = $GG^T$, where $G = L \text{diag}(\sqrt{d_1}, ..., \sqrt{d_n})$. The decomposition $A = GG^T$ is known as the cholesky decomposition and can be efficiently constructed in $n^3 / 3$ flops. There are a number of algorithms to construct this decomposition, and both the wikipedia entry and Chapter 4.2 of the Matrix Computations textbook by Golub and Van Loan gives a number of different varients.
Note that a $LDL$ decomposition can also be used to calculate a cholesky decomposition, and this could be more efficient approach since (a) the SPD structure means that we can neglect pivoting in the $LDL$ decomposition, and (b) the $LDL$ decomposition does not requiring taking the square root of the diagonal elements.
Imagine that we wanted to sample an array of values $x_i$, for $i = 1...n$, where each value is sampled from an independent normal distribution with standard deviation $\sigma$
\[x_i \sim \mathcal{N}(0, \sigma)\]
This could be achieved, for example, by sampling from a normal distribution with unit standard deviation, a function that typically exists in any computer language, then multiplying by $\sigma$
\[x_i = \sigma \eta\]
where $\eta \sim \mathcal{N}(0, 1)$
Now imagine that instead of an independent normal distribution you wish to sample $\mathbf{x} = [x_1, x_2, ..., x_n]$ from a multivariate normal distribution with some covariance matrix $\Sigma$
\[\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \Sigma)\]
We can achive this in practice by using the Cholesky decomposition. A covariance matrix is a symmetic positive semidefinite matrix (i.e. $x^T \Sigma x \ge 0$}, and therefore can be decomposed into $\Sigma = LL^T$. We can then draw a sample from $\mathcal{N}(\mathbf{0}, \Sigma)$ by scaling an independently generated random vector by $L$
\[\mathbf{x} = L \mathbf{\eta}\]
where each element of the vector $\eta$ is $\eta_i \sim \mathcal{N}(0, 1)$.
Write Python code to randomly sample an n-dimensional vector $x$ from
an independent normal distribution with variance $\sigma_1^2$.
a multivariate normal distribution using a covariance matrix $\Sigma_{ij} = \sigma_1^2 \exp{(-(i- j)^2 / \sigma_2^2)}$. Try different values for the magnitute $\sigma_1$, and lenghtscale $\sigma_2$ parameters and their effect on the sampled $\mathbf{x}$. Hint: while the expression for $\Sigma$ is guarrenteed to be positive definte for all values of $\sigma_1$ and $\sigma_2$, numerical round-off can mean that the Cholesky decomposition can fail. To guarrentee a positive definite $\Sigma$, try adding a small amount (e.g. 1e-5) to the diagonal of $\Sigma$. This is equivilent to adding a very small amount of independent normal noise to $\mathbf{x}$.
import numpy as np
import matplotlib.pylab as plt
import scipy
import scipy.linalg
n = 100
xs = np.arange(0, n)
def construct_covariance(sigma1, sigma2):
K = sigma1**2 * np.exp(
-(xs[:, np.newaxis] - xs[:, np.newaxis].T)**2 / sigma2**2
)
K += 1e-5 * np.eye(n)
return K
def sample_zero_mean_random_field(covariance_matrix):
L, _ = scipy.linalg.cho_factor(covariance_matrix, lower=True)
indep_random_sample = np.random.normal(size=n)
return np.dot(np.tril(L), indep_random_sample)
sigma1 = 1.0
sigma2 = 10.0
K1 = sigma1**2 * np.eye(n)
K2 = construct_covariance(sigma1, sigma2)
plt.plot(sample_zero_mean_random_field(K1))
plt.plot(sample_zero_mean_random_field(K2))
plt.show()
Now imagine that we have a vector of measurements $\mathbf{x}$, and we assume that a suitable model for these measurements is that they are generated from a zero-mean, multivariate normal distribuion, i.e.
\[\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \Sigma)\]
We assume that the covariance matrix is of the following form, with two parameters $\mathbf{\theta} = (\sigma_1, \sigma_2)$.
\[\Sigma_{ij} = \sigma_1^2 \exp{(-(i- j)^2/ \sigma_2^2)}\]
We can write down the likelihood of the covariance parameters $\mathbf{\theta}$, given a given dataset $\mathbf{x}$, by using the probability distribution function (PDF) for a zero-mean multivariate normal distribution
\[ P(\mathbf{\theta} | \mathbf{x}) = (2 \pi)^{\frac{n}{2}} \text{ det}(\Sigma)^{\frac{1}{2}} \exp{\left( \frac{1}{2}\mathbf{x}^T \Sigma^{-1} \mathbf{x}\right)} \]
Typically we work with the log of the likelihood for numerical reasons, which is
\[ \mathcal{L} = -\frac{1}{2} \log(|\Sigma|) + \mathbf{x}^T \Sigma^{-1} \mathbf{x} + \frac{n}{2} \log(2\pi) \]
def log_likelihood(sigma1, sigma2, x):
K = construct_covariance(sigma1, sigma2)
L, lower = scipy.linalg.cho_factor(K, lower=True)
logdet = 2 * np.sum(np.log(np.diag(L)))
xinvSx = x.dot(scipy.linalg.cho_solve((L, lower), x))
return -0.5 * (logdet + xinvSx)
sigma1 = np.linspace(0.5, 1.5, 100)
sigma2 = np.linspace(5.0, 15.0, 100)
Sigma1, Sigma2 = np.meshgrid(sigma1, sigma2)
x = sample_zero_mean_random_field(K2)
L = np.vectorize(log_likelihood, excluded=['x'])(Sigma1, Sigma2, x=x)
max_log_likelihood = log_likelihood(1, 10, x)
levels = np.linspace(0.9 * max_log_likelihood, max_log_likelihood, 5)
plt.clf()
contours = plt.contour(Sigma1, Sigma2, L, levels=levels, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(L, extent=[0.5, 1.5, 5.0, 15.0], origin='lower',
cmap='RdGy', alpha=0.5, vmin=levels[0], aspect='auto')
c = plt.colorbar();
c.set_label(r'$\mathcal{L}$')
plt.xlabel(r'$\sigma_1$')
plt.ylabel(r'$\sigma_2$')
plt.show()