In [1]:
import numpy as np
from scipy import stats, optimize
import matplotlib.pyplot as plt

Gaussian processes in numpy¶

Exercise 4.1: Sample from GP prior¶

In this exercises we will write the code needed to draw and plot samples of $f$ from a Gaussian process prior with squared exponential (or, equivalently, RBF) kernel \begin{equation*} f \sim \mathcal{GP}(m, \kappa), \qquad m(x) = 0 \text{ and } \kappa(x, x') = \sigma^2_f \exp{\Big(−\frac{1}{2l^2} {\|x - x'\|}^2\Big)}. \end{equation*} To implement this, we choose a vector of $m$ test input points $\mathbf{x}_*$. We will choose $\mathbf{x}_*$ to contain sufficiently many points, such that it will appear as a continuous line on the screen. We then evaluate the $m \times m$ covariance matrix $\kappa(\mathbf{x}_∗, \mathbf{x}_∗)$ and thereafter generate samples from the multivariate normal distribution \begin{equation*} f(\mathbf{x}_∗) \sim \mathcal{N}\big(m(\mathbf{x}_∗), \kappa(\mathbf{x}_∗, \mathbf{x}_∗)\big). \end{equation*}

(a)¶

Use numpy.linspace to construct a vector $\mathbf{x}_*$ with $m = 101$ elements equally spaced from -5 to 5.

In [2]:
# test input points
m = 101
xs = np.linspace(-5, 5, m)

(b)¶

Construct a mean vector $m(\mathbf{x}_∗)$ with 101 elements all equal to zero and the $101 \times 101$ covariance matrix $\kappa(\mathbf{x}_*, \mathbf{x}_∗)$. The expression for $\kappa(\cdot, \cdot)$ is given above. Let the hyperparameters be $\ell^2 = 2$ and $\sigma^2_f = 1$.

In [3]:
# mean vector
mxs = np.zeros(m)

# covariance matrix
l = np.sqrt(2)
sf2 = 1
Kxs = sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - xs[None, :])**2)

(c)¶

Use scipy.stats.multivariate_normal (you might need to use the option allow_singular=True) to draw 25 samples $f^{(1)}(\mathbf{x}_*), \ldots, f^{(25)}(\mathbf{x}_*)$ from the multivariate normal distribution $f(\mathbf{x}_∗) \sim \mathcal{N}\big(m(\mathbf{x}_*), \kappa(\mathbf{x}_*, \mathbf{x}_*)\big)$.

In [4]:
fs = stats.multivariate_normal.rvs(
    mean=mxs, cov=Kxs, size=25, random_state=1234
)

(d)¶

Plot the samples $f^{(1)}(\mathbf{x}_∗), \ldots, f^{(25)}(\mathbf{x}_∗)$ versus the input vector $\mathbf{x}_*$.

In [5]:
plt.plot(xs, fs.T)
plt.show()

(e)¶

Try another value of $\ell$ and repeat (b)-(d). How do the two plots differ, and why?

In [6]:
# samples with l = 1/2
l = 1/2
Kxs = sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - xs[None, :])**2)
fs = stats.multivariate_normal.rvs(
    mean=mxs, cov=Kxs, size=25, random_state=1234
)
plt.plot(xs, fs.T)
plt.show()

Exercise 4.2: GP posterior¶

In this exercise we will perform Gaussian process regression. That means, based on the $n$ observations $\mathcal{D} = \{x_i, f(x_i)\}^n_{i=1}$ and the prior belief $f \sim \mathcal{GP}\big(0, \kappa(x, x')\big)$, we want to find the posterior $p(f | \mathcal{D})$. (In the previous problem, we were only concerned with the prior $p(f)$, not conditioned on having observed the data $\mathcal{D}$.) We consider the same Gaussian process prior (same mean $m(x)$ and $\kappa(x, x')$ and hyperparameters) as in the previous exercise.

(a)¶

Construct two vectors $\mathbf{x} = [-4, -3, -1, 0, 2]^\mathsf{T}$ and $\mathbf{y} = [-2, 0, 1, 2, -1]^\mathsf{T}$, which will be our training data (that is, $n = 5$).

In [7]:
# observations
n = 5
x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])

(b)¶

Keep $\mathbf{x}_∗$ as in the previous problem. In addition to the $m \times m$ matrix $\kappa(\mathbf{x}_∗, \mathbf{x}_∗)$, now also compute the $n \times m$ matrix $\kappa(\mathbf{x}, \mathbf{x}_∗)$ and the $n \times n$ matrix $\kappa(\mathbf{x}, \mathbf{x})$.

Hint: You might find it useful to define a function that returns $\kappa(x, x')$, taking $x$ and $x'$ as arguments.

In [8]:
# function that computes the kernel matrix κ(x, y)
def k(xs, ys, l, sf2):
    return sf2 * np.exp(-1/(2 * l**2) * np.abs(xs[:, None] - ys[None, :])**2)

# set hyperparameters
l = np.sqrt(2)
sf2 = 1

# set test inputs
m = 101
xs = np.linspace(-5, 5, m)

# compute kernel matrices
Kx = k(x, x, l, sf2)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

(c)¶

Use the training data $\mathbf{x}$, $\mathbf{y}$ and the matrices constructed in (b) to compute the posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_∗$, by using the equations for conditional multivariate normal distribution.

The posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ is given by \begin{equation*} \boldsymbol{\mu}_{\mathrm{posterior}}(\mathbf{x}_*) = \kappa(\mathbf{x}, \mathbf{x}_*)^\mathsf{T} \kappa(\mathbf{x}, \mathbf{x})^{-1} \mathbf{y} = {\Big(\kappa(\mathbf{x}, \mathbf{x})^{-1} \kappa(\mathbf{x}, \mathbf{x}_*)\Big)}^\mathsf{T} \mathbf{y}, \end{equation*} and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ by \begin{equation*} \mathbf{K}_{\mathrm{posterior}}(\mathbf{x}_*, \mathbf{x}_*) = \kappa(\mathbf{x}_*, \mathbf{x}_*) - \kappa(\mathbf{x}, \mathbf{x}_*)^{\mathsf{T}}\kappa(\mathbf{x}, \mathbf{x})^{-1} \kappa(\mathbf{x}, \mathbf{x}_*). \end{equation*}

In [9]:
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

(d)¶

In a similar manner as in (c) and (d) in the previous problem, draw 25 samples from the multivariate distribution $f(\mathbf{x}_∗) \sim \mathcal{N}(\boldsymbol{\mu}_{\mathrm{posterior}}, \mathbf{K}_{\mathrm{posterior}})$ and plot these samples ($f^{(j)} (\mathbf{x}_∗)$ vs. $\mathbf{x}_∗$) together with the posterior mean ($\boldsymbol{\mu}_{\mathrm{posterior}}$ vs. $\mathbf{x}_∗$) and the actual measurements ($\mathbf{f}$ vs. $\mathbf{x}$). How do the samples in this plot differ from the prior samples in the previous problem?

In [10]:
# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()

All posterior samples pass through the observed data points (the prior samples do not necessarily do that). This is natural since the posterior distribution of $f(\mathbf{x}_∗)$ is conditioned on the observations, and hence the posterior distribution must pass through them.

(e)¶

Instead of plotting samples, plot a credibility region. Here, a credibility region is based on the (marginal) posterior variance. The 68% credibility region, for example, is the area between $\boldsymbol{\mu}_{\mathrm{posterior}} - \sqrt{\mathbf{K}^d_{\mathrm{posterior}}}$ and $\boldsymbol{\mu}_{\mathrm{posterior}} + \sqrt{\mathbf{K}^d_{\mathrm{posterior}}}$, where $\mathbf{K}^d_{\mathrm{posterior}}$ is a vector with the diagonal elements of $\mathbf{K}_{\mathrm{posterior}}$. What is the connection between the credibility regions and the samples you drew previously?

The 68% credibility region, for example, contains 68% of the posterior samples.

In [11]:
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()

(f)¶

Now, consider the setting where the measurements are corrupted with noise, $y_i = f(\mathbf{x}_i) + \varepsilon$, $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. Use $\sigma^2 = 0.1$ and repeat (c)-(e) with this modification of the model. What is the difference in comparison to the previous plot? What is the interpretation?

Now the posterior mean $\boldsymbol{\mu}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ is given by \begin{equation*} \boldsymbol{\mu}_{\mathrm{posterior}}(\mathbf{x}_*) = \kappa(\mathbf{x}, \mathbf{x}_*)^\mathsf{T} [\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \mathbf{y} = {\Big([\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \kappa(\mathbf{x}, \mathbf{x}_*)\Big)}^\mathsf{T} \mathbf{y}, \end{equation*} and the posterior covariance $\mathbf{K}_{\mathrm{posterior}}$ for $\mathbf{x}_*$ by \begin{equation*} \mathbf{K}_{\mathrm{posterior}}(\mathbf{x}_*, \mathbf{x}_*) = \kappa(\mathbf{x}_*, \mathbf{x}_*) - \kappa(\mathbf{x}, \mathbf{x}_*)^{\mathsf{T}}[\kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 I]^{-1} \kappa(\mathbf{x}, \mathbf{x}_*). \end{equation*}

In [12]:
# compute kernel matrices, with noise
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()

The posterior distribution does not pass exactly through all observations anymore, but the observations are now to some extent "explained" as noise.

(g)¶

Explore what happens with another length scale $\ell$.

In [13]:
# use length scale â„“=1/2
l = 1/2

# compute kernel matrices, with noise
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()

Exercise 4.3: Other covariance functions/kernels¶

The squared exponential kernel/covariance function gives samples which are smooth and infinitely continuously differentiable. Other kernels make other assumptions. Now try the previous problems using the exponential kernel instead, \begin{equation*} \kappa(x, x') = \exp{\Big(-\frac{1}{l}\|x - x'\|\Big)}. \end{equation*}

We change the function that computes the kernel matrix:

In [14]:
def k(xs, ys, l, sf2):
    return sf2 * np.exp(-1/l * np.abs(xs[:, None] - ys[None, :]))

We repeat the analysis above:

In [15]:
# compute kernel matrices, without noise
l = np.sqrt(2)
sf2 = 1
Kx = k(x, x, l, sf2)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), without noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), without noise")
plt.show()
In [16]:
# compute kernel matrices, with noise
l = np.sqrt(2)
sf2 = 1
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()
In [17]:
# compute kernel matrices, with noise and â„“=1/2
l = 1/2
sf2 = 1
s2 = 0.1
Kx = k(x, x, l, sf2) + s2 * np.eye(n)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()

Exercise 4.4: Learning hyperparameters¶

Until now, we have made GP regression using predefined hyperparameters, such as the lengthscale $\ell$ and noise variance $\sigma^2$. In this exercise, we will estimate $\ell$ and $\sigma^2$ from the data by maximizing the marginal likelihood. The logarithm of the marginal likelihood for a Gaussian process observed with Gaussian noise is \begin{equation*} \log p(\mathbf{y} | \mathbf{x}; \ell, \sigma_f^2, \sigma^2) = \log \mathcal{N}(\mathbf{y} | 0, \mathbf{K}_y) = -\frac{1}{2} \mathbf{y}^\mathsf{T} \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2} \log{|\mathbf{K}_y|} - \frac{n}{2}\log{(2\pi)} \end{equation*} where $\mathbf{K}_y = \kappa(\mathbf{x}, \mathbf{x}) + \sigma^2 \mathbf{I}$.

(a)¶

Write a function that takes $\mathbf{x}$, $\mathbf{y}$, $\ell$, $\sigma_f^2$, and $\sigma^2$ as inputs and produces the logarithm of the marginal likelihood as output for the squared exponential covariance function.

In [18]:
# function that computes the kernel matrix κ(x, y)
def k(x1, x2, l, sf2):
    return sf2 * np.exp(-1/(2 * l**2) * np.abs(x1[:, None] - x2[None, :])**2)

def log_marginal_likelihood(x, y, l, sf2, s2):
    n = len(x)
    Ky = k(x, x, l, sf2) + s2 * np.eye(n)
    _, logdet = np.linalg.slogdet(Ky)
    return -(y.T@(np.linalg.solve(Ky, y)) + logdet + n * np.log(2 * np.pi)) / 2

(b)¶

Consider the same data as before. Use $\sigma_f^2 = 1$ and $\sigma^2 = 0$ and compute the logarithm of the marginal likelihood for values of $\ell$ between 0.1 and 1 and plot it. What seems to be the maximal value of the marginal likelihood on this interval? Do GP regression based on this value of $\ell$.

In [19]:
# observations
n = 5
x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])

# fixed hyperparameters
sf2 = 1
s2 = 0

# compute values of the marginal likelihood
ls = np.linspace(0.1, 1, 101)
logliks = [log_marginal_likelihood(x, y, l, sf2, s2) for l in ls]

# plot values
plt.plot(ls, logliks)
plt.show()

The marginal likelihood seems to be maximized for a value of $\ell$ around 0.6. We can determine the maximum more precisely with scipy.optimize.minimize. Since the length scale $\ell$ is positive we optimize $\log \ell$.

In [20]:
result = optimize.minimize(
    lambda log_l: -log_marginal_likelihood(x, y, np.exp(log_l), sf2, s2),
    np.log(np.array([0.6])),
)
result
Out[20]:
      fun: 9.29788863452336
 hess_inv: array([[0.21693984]])
      jac: array([0.])
  message: 'Optimization terminated successfully.'
     nfev: 12
      nit: 4
     njev: 6
   status: 0
  success: True
        x: array([-0.46100059])

We use the optimal value of $\ell$ in our GP regression:

In [21]:
# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
Kx = k(x, x, l, sf2)
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, without noise")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, without noise")
plt.show()

We can also optimize both $\ell$ and $\sigma^2$ (we still assume $\sigma_f^2 = 1$). Again we perform optimization in log space.

In [22]:
# optimize hyperparameters
result = optimize.minimize(
    lambda params: -log_marginal_likelihood(x, y, np.exp(params[0]), sf2, np.exp(params[1])),
    np.log(np.array([np.sqrt(2), 0.1])),
)
print(result.message)

# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
s2 = np.exp(result.x[1])
Kx = k(x, x, l, sf2) + s2 * np.eye(len(x))
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()
Optimization terminated successfully.

Exercise 4.5: Learning hyperparameters II¶

In this exercise we investigate a setting where the marginal likelihood has multiple local minima.

(a)¶

Now, consider the following data \begin{equation*} \mathbf{x} = [−5, −3, 0, 0.1, 1, 4.9, 5]^\mathsf{T}, \qquad \mathbf{y} = [0, −0.5, 1, 0.7, 0, 1, 0.7]^\mathsf{T} \end{equation*} and compute the log marginal likelihood for both $\ell$ and $\sigma^2$. Use a logarithmic 2D-grid for values of $\ell$ spanning from $10^{−1}$ to $10^2$ and for $\sigma^2$ spanning from $10^{−2}$ to $10^0$. Visualize the marginal likelihood on that grid with a contour plot.

In [23]:
# observations
n = 7
x = np.array([-5, -3, 0, 0.1, 1, 4.9, 5])
y = np.array([0, -0.5, 1, 0.7, 0, 1, 0.7])

# function that computes the kernel matrix κ(x, y)
def k(x1, x2, l, sf2):
    return sf2 * np.exp(-1/(2 * l**2) * np.abs(x1[:, None] - x2[None, :])**2)

def log_marginal_likelihood(x, y, l, sf2, s2):
    n = len(x)
    Ky = k(x, x, l, sf2) + s2 * np.eye(n)
    _, logdet = np.linalg.slogdet(Ky)
    return -(y.T@(np.linalg.solve(Ky, y)) + logdet + n * np.log(2 * np.pi)) / 2    

# fixed hyperparameter
sf2 = 1

# range of values of ℓ and σ2 that are plotted
ls = np.logspace(-1, 1.1, 101)
s2s = np.logspace(-2, -0.5, 101)
logliks = [[log_marginal_likelihood(x, y, l, sf2, s2) for l in ls] for s2 in s2s]

# plot log marginal likelihood values
plt.contour(ls, s2s, np.array(logliks), 500, vmin=-7)
plt.colorbar()
plt.show()

(b)¶

Find the hyperparameters $\ell$ and $\sigma^2$ that correspond to the maximal marginal likelihood. Perform GP regression on the data using these hyperparameters.

In [24]:
# optimize hyperparameters
result = optimize.minimize(
    lambda params: -log_marginal_likelihood(x, y, np.exp(params[0]), sf2, np.exp(params[1])),
    np.log(np.array([10, 0.2])),
)
print(result.message)

# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
s2 = np.exp(result.x[1])
Kx = k(x, x, l, sf2) + s2 * np.eye(len(x))
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()
Optimization terminated successfully.

(d)¶

Perform GP regression for the hyperparameters that correspond to other possible local optima of the marginal likelihood. What differences do you see in your posterior?

In [25]:
# optimize hyperparameters
result = optimize.minimize(
    lambda params: -log_marginal_likelihood(x, y, np.exp(params[0]), sf2, np.exp(params[1])),
    np.log(np.array([1, 0.05])),
)
print(result.message)

# compute kernel matrices, with noise
l = np.exp(result.x[0])
sf2 = 1
s2 = np.exp(result.x[1])
Kx = k(x, x, l, sf2) + s2 * np.eye(len(x))
Kxs = k(xs, xs, l, sf2)
Kxxs = k(x, xs, l, sf2)

# compute mean and covariance of posterior
A = np.linalg.solve(Kx, Kxxs)
mu_post = A.T@y
K_post = Kxs - Kxxs.T@A

# sample from the posterior for x*
fs = stats.multivariate_normal.rvs(
    mean=mu_post, cov=K_post, size=25, random_state=1234
)

# plot samples, posterior mean and observations
plt.plot(xs, fs.T, color='0.6')
plt.plot(xs, mu_post)
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()

plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs,
        mu_post + i*np.sqrt(np.diag(K_post)),
        mu_post - i*np.sqrt(np.diag(K_post)),
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title(f"l = {l:.3f}, with noise σ2 = {s2:.3f}")
plt.show()
Optimization terminated successfully.