May 4, 2019 · notes math ml

Rough Notes on Gaussian Processes

References Used

I'll start off by listing the references that were useful to me for understanding Gaussian Processes. If you want a deep understanding of how and why Gaussian Processes work, definitely check them out.

If you want to use them in practice, the scikit-learn implementation and the following guide should be enough.

Summary

  • Gaussian Processes assign a probability to functions, or predictions, that fit a regression task
  • Each unknown (test) data point is considered a dimension in a multivariate Gaussian distribution
    • Remember that the variables in a multivariate Gaussian distribution are correlated with other variables to varying degrees, and this is captured in a covariance matrix
  • The prior distribution is of dimension \(|\text{training}|\) + \(|\text{test}|\)
  • The posterior distribution is the conditional distribution, where we assume the training data, and therefore it is of dimension \(|\text{test}|\)
  • The prior distribution is constructed (assuming \(\mu = 0\)) using a kernel that incorporates our knowledge. Examples include, RBF, Periodic, and Linear
    • The mean \(\mu\) is ignored, since it can be added and subtracted from the distribution without affecting it
    • The kernel only requires the \(X\) from a \((X,y)\) training+test dataset to construct the covariance matrix
  • The posterior distribution can then sampled from, or the best function/prediction can be extracted
  • If the training data is noisy, an error term can be added to improve the model

Benefits

  • Provides a range of functions/predictions within a certain confidence interval
  • Kernels can be chosen that are appropriate for your dataset, and thus incorporates domain knowledge
  • The hyperparameters of a kernel can be optimized based on the training data
  • Kernels can be added and multiplied to create completely new kernels, in order to improve your prior distribution

Limitations

  • Gaussian Processes are used for functions that output only a 1D vector, i.e. "regression"
  • Not as efficient in very high-dimensional spaces

Extensions of Gaussian Processes

  • Can be used for classification, by optimizing a Gaussian Process for each class probability (can be trained through "one-versus-rest")
  • Not necessarily an extension, but kernels used in Gaussian Processes are similar to those used in Support Vector Machines

Choosing a Kernel

David Duvenaud from the University of Toronto has published an excellent guide on choosing an appropriate kernel. The most common kernels to use are Squared Exponential Kernel and Rational Quadratic Kernel.