In [1]:
import numpy as np
from matplotlib import pyplot as plt

from sklearn import gaussian_process
from sklearn.gaussian_process import kernels

import pickle
from urllib.request import urlopen

Gaussian processes in scikit-learn

Exercise 5.1: GP posterior

Repeat exercise 4.2 using GaussianProcessRegressor from sklearn.gaussian_process, only plotting the credibility regions based on the posterior predictive variance (not drawing samples).

Some useful hints:

  • As all supervised machine learning methods in sklearn, you first have to construct an object from the model class (in this case GaussianProcessRegressor), and thereafter train it on data by using its member function fit(). To obtain predictions, use the member function predict(). To the latter, you will either have to pass return_std=True or return_cov=True in order to obtain information about the posterior predictive variance.
  • When you construct the model, you have to define a kernel. The kernels are available in sklearn.gaussian_process.kernel, where the squared exponential/RBF kernel is available as RBF.
  • The function fit() automatically optimizes the hyperparameters. To turn that feature off, you have to pass the argument optimizer=None to GaussianProcessRegressor.
  • To include the measurement noise, you can formulate it as part of the kernel by using the kernel WhiteKernel.

We start by defining the observations from exercise 4.2:

In [2]:
x = np.array([-4, -3, -1, 0, 2])
y = np.array([-2, 0, 1, 2, -1])

We define the test inputs as in exercise 4.2:

In [3]:
xs = np.linspace(-5, 5, 101)

We define an RBF kernel with lengthscale $\ell^2 = 2$. Note that RBF uses $\sigma_f^2 = 1$, if you want to scale the kernel you would have to multiply it with a ConstantKernel.

In [4]:
l = np.sqrt(2)
kernel = kernels.RBF(length_scale=l)

We construct a Gaussian process with this RBF kernel. We specify optimizer=None to avoid any hyperparameter tuning and random_state to ensure that the results are reproducible. Note that GaussianProcessRegressor always uses a Gaussian process with zero mean.

In [5]:
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, optimizer=None, random_state=100
)

We update the Gaussian process regression model by fitting it to the observations, i.e., we compute the posterior Gaussian process. fit() expects a matrix of observations $\mathbf{x}$ and therefore we reshape x.

In [6]:
gp.fit(x[:, None], y)
Out[6]:
GaussianProcessRegressor(kernel=RBF(length_scale=1.41), optimizer=None,
                         random_state=100)

We compute the mean and the standard deviation of the normally distributed function values for the test inputs.

In [7]:
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

We plot the data and the posterior distribution for the test inputs.

In [8]:
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), no noise")
plt.show()

We can add Gaussian measurement noise with variance $\sigma^2 = 0.1$ by constructing the kernel as a sum of an RBF kernel and a WhiteKernel.

In [9]:
l = np.sqrt(2)
s2 = 0.1
kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)

We perform the same analysis again:

In [10]:
# construct Gaussian process (without hyperparameter tuning)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, optimizer=None, random_state=100
)

# fit model to observations
gp.fit(x[:, None], y)

# posterior mean and standard deviation of function values for test inputs
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = sqrt(2), with noise")
plt.show()

We repeat the study with length scale $\ell = 1/2$.

In [11]:
# construct kernel
l = 1/2
s2 = 0.1
kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)

# construct GP regression model (without hyperparameter tuning)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, optimizer=None, random_state=100
)

# fit model to observations
gp.fit(x[:, None], y)

# posterior mean and standard deviation of function values for test inputs
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.title("l = 0.5, with noise")
plt.show()

Exercise 5.2: Learning hyperparameters

Until now, we have made GP regression using predefined hyperparameters, such as the lengthscale $\ell$ and noise variance $\sigma^2$. In this exercise, we will estimate $\ell$ and $\sigma^2$ from the data by maximizing the marginal likelihood (cf. exercise 4.4 and 4.5). That is done automatically by the fit() function in scikit-learn. Use, as before, the RBF kernel and measurement noise together, this time with the data \begin{equation*} \mathbf{x} = \begin{bmatrix}-5 &-3 &0 &0.1 &1 &4.9 &5\end{bmatrix}^\mathsf{T} \,\text{ and }\, \mathbf{y} = \begin{bmatrix}0 &-0.5 &1 &0.7 &0 &1 &0.7\end{bmatrix}^\mathsf{T}. \end{equation*}

(a)

You still have to provide an initial value of the hyperparameters. Try $\ell = 1$ and $\sigma^2 = 0.1$. What hyperparameters do you get when optimizing? Plot the corresponding mean and credibility regions.

We define the data

In [12]:
x = np.array([-5, -3, 0, 0.1, 1, 4.9, 5])
y = np.array([0, -0.5, 1, 0.7, 0, 1, 0.7])

and the GP regression model with an RBF kernel with measurement noise with initial hyperparameters $\ell = 1$ and $\sigma^2 = 0.1$:

In [13]:
kernel = kernels.RBF(length_scale=1) + kernels.WhiteKernel(noise_level=0.1)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, random_state=100
)

When we fit the GP regression model to the observations with fit(), the hyperparameters $\ell$ and $\sigma^2$ are optimized automatically. We see that the optimized kernel gp.kernel_ of the model has a length scale $\ell = 1$ and a noise variance $\sigma^2 = 0.032$.

In [14]:
gp.fit(x[:, None], y)
print(gp.kernel_)
RBF(length_scale=1) + WhiteKernel(noise_level=0.032)

We plot the posterior mean and credibility regions of the function values for 101 equally spaced test inputs between -6 and 6.

In [15]:
# posterior mean and standard deviation of function values for test inputs
xs = np.linspace(-6, 6, 101)
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.show()

(b)

Try instead to initialize with $\ell = 10$ and $\sigma^2 = 1$. What do you get now?

We repeat the analysis in part (a) but initialize the kernel with hyperparameters $\ell = 10$ and $\sigma^2 = 1$.

In [16]:
# initialize kernel and GP regression model
kernel = kernels.RBF(length_scale=10) + kernels.WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
    kernel=kernel, random_state=100
)

# fit model to observations and display optimized hyperparameters
gp.fit(x[:, None], y)
print(gp.kernel_)

# posterior mean and standard deviation of function values for test inputs
xs = np.linspace(-6, 6, 101)
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y)
plt.show()
RBF(length_scale=10.7) + WhiteKernel(noise_level=0.222)

(c)

Try to explain what happens by making a grid over different hyperparameter values, and inspect the marginal likelihood for each point in that grid. The GaussianProcessRegressor class has a member function log_marginal_likelihood() which you may use. (Do not forget to turn off the hyperparameter optimization!)

We implement a function that computes the marginal likelihood for a given lengthscale $\ell$ and noise variance $\sigma^2$.

In [17]:
def marginal_likelihood(l, s2):
    # construct kernel with measurement noise
    kernel = kernels.RBF(length_scale=l) + kernels.WhiteKernel(noise_level=s2)

    # construct GP regression model (without hyperparameter tuning!)
    gp = gaussian_process.GaussianProcessRegressor(kernel=kernel, optimizer=None)
    
    # compute posterior GP and evaluate marginal likelihood
    gp.fit(x[:, None], y)
    return np.exp(gp.log_marginal_likelihood())

Using this function we compute and plot the marginal likelihood for a logarithmically spaced grid of lengthscales $\ell$ between 0.1 and 100 and noise variances $\sigma^2$ between 0.01 and 1.

In [18]:
# compute marginal likelihood for a grid of lengthscales and noise variances
ls = np.logspace(-1, 2, 51)
s2s = np.logspace(-2, 0, 51)
marginal_likelihoods = [[marginal_likelihood(l, s2) for l in ls] for s2 in s2s]

# plot marginal likelihoods
plt.contour(ls, s2s, marginal_likelihoods, levels=20)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("l")
plt.ylabel("s2")
plt.show()

Exercise 5.3: Modeling CO₂ levels

The amount of carbon dioxide in the atmosphere has been measured continuously at the Mauna Loa observatory, Hawaii. In this problem, you should use a Gaussian process to model the data from 1958 to 2003, and see how well that model can be used for predicting the data from 2004-2019. They present their latest data at their homepage, but for your convenience you can use the data in the format available here. You can load the data with the following code snippet:

In [19]:
# download data
data = pickle.load(
    urlopen("https://github.com/gpschool/labs/raw/2019/.resources/mauna_loa")
)

# extract observations and test data
x = data['X'].flatten()
y = data['Y'].flatten()
xtest = data['Xtest'].flatten()
ytest = data['Ytest'].flatten()

Here, x and y contain your training data.

Start exploring some simple kernels, and thereafter you may have a look at page 118-122 of the book Gaussian processes for machine learning for some inspiration on how to design a more bespoke kernel for this problem.

We create a GP regression model with an RBF kernel and measurement noise and fit it to the data, including hyperparameter tuning.

In [20]:
kernel = 100 * kernels.RBF(length_scale=100) + kernels.WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(x[:, None], y)
print(gp.kernel_)
316**2 * RBF(length_scale=52.4) + WhiteKernel(noise_level=4.51)
/home/david/.cache/pypoetry/virtualenvs/lab-apml-8a3A0nYE-py3.9/lib/python3.9/site-packages/sklearn/gaussian_process/kernels.py:411: ConvergenceWarning: The optimal value found for dimension 0 of parameter k1__k1__constant_value is close to the specified upper bound 100000.0. Increasing the bound and calling fit again may find a better value.
  warnings.warn("The optimal value found for "

We plot posterior mean and credibility regions for years 1957 to 2040.

In [21]:
# posterior mean and standard deviation of function values for test inputs
xs = np.linspace(1957, 2040, 200)
mu_post, std_post = gp.predict(xs[:, None], return_std=True)

# plot data, posterior mean and credibility regions
plt.plot(xs, mu_post)
for i in [3, 2, 1]:
    # plot credibility regions (1, 2, and 3 standard deviations)
    plt.fill_between(
        xs, mu_post + i*std_post, mu_post - i*std_post,
        color=str(0.4 + 0.15*i), # '0' => black, '1' => white
    )
plt.scatter(x, y, s=1)
plt.scatter(xtest, ytest, s=1)
plt.show()

More elaborate models are implemented e.g. here and in exercise 7 here (however using another GP package).