{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n========================================================\nGaussian process regression (GPR) on Mauna Loa CO2 data.\n========================================================\n\nThis example is based on Section 5.4.3 of \"Gaussian Processes for Machine\nLearning\" [RW2006]. It illustrates an example of complex kernel engineering and\nhyperparameter optimization using gradient ascent on the\nlog-marginal-likelihood. The data consists of the monthly average atmospheric\nCO2 concentrations (in parts per million by volume (ppmv)) collected at the\nMauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to\nmodel the CO2 concentration as a function of the time t.\n\nThe kernel is composed of several terms that are responsible for explaining\ndifferent properties of the signal:\n\n- a long term, smooth rising trend is to be explained by an RBF kernel. The\n RBF kernel with a large length-scale enforces this component to be smooth;\n it is not enforced that the trend is rising which leaves this choice to the\n GP. The specific length-scale and the amplitude are free hyperparameters.\n\n- a seasonal component, which is to be explained by the periodic\n ExpSineSquared kernel with a fixed periodicity of 1 year. The length-scale\n of this periodic component, controlling its smoothness, is a free parameter.\n In order to allow decaying away from exact periodicity, the product with an\n RBF kernel is taken. The length-scale of this RBF component controls the\n decay time and is a further free parameter.\n\n- smaller, medium term irregularities are to be explained by a\n RationalQuadratic kernel component, whose length-scale and alpha parameter,\n which determines the diffuseness of the length-scales, are to be determined.\n According to [RW2006], these irregularities can better be explained by\n a RationalQuadratic than an RBF kernel component, probably because it can\n accommodate several length-scales.\n\n- a \"noise\" term, consisting of an RBF kernel contribution, which shall\n explain the correlated noise components such as local weather phenomena,\n and a WhiteKernel contribution for the white noise. The relative amplitudes\n and the RBF's length scale are further free parameters.\n\nMaximizing the log-marginal-likelihood after subtracting the target's mean\nyields the following kernel with an LML of -83.214::\n\n 34.4**2 * RBF(length_scale=41.8)\n + 3.27**2 * RBF(length_scale=180) * ExpSineSquared(length_scale=1.44,\n periodicity=1)\n + 0.446**2 * RationalQuadratic(alpha=17.7, length_scale=0.957)\n + 0.197**2 * RBF(length_scale=0.138) + WhiteKernel(noise_level=0.0336)\n\nThus, most of the target signal (34.4ppm) is explained by a long-term rising\ntrend (length-scale 41.8 years). The periodic component has an amplitude of\n3.27ppm, a decay time of 180 years and a length-scale of 1.44. The long decay\ntime indicates that we have a locally very close to periodic seasonal\ncomponent. The correlated noise has an amplitude of 0.197ppm with a length\nscale of 0.138 years and a white-noise contribution of 0.197ppm. Thus, the\noverall noise level is very small, indicating that the data can be very well\nexplained by the model. The figure shows also that the model makes very\nconfident predictions until around 2015.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Authors: Jan Hendrik Metzen \n#\n# License: BSD 3 clause\n\nfrom __future__ import division, print_function\n\nimport numpy as np\n\nfrom matplotlib import pyplot as plt\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.gaussian_process import GaussianProcessRegressor\nfrom sklearn.gaussian_process.kernels \\\n import RBF, WhiteKernel, RationalQuadratic, ExpSineSquared\n\nprint(__doc__)\n\n\ndef load_mauna_loa_atmospheric_co2():\n ml_data = fetch_openml(data_id=41187)\n months = []\n ppmv_sums = []\n counts = []\n\n y = ml_data.data[:, 0]\n m = ml_data.data[:, 1]\n month_float = y + (m - 1) / 12\n ppmvs = ml_data.target\n\n for month, ppmv in zip(month_float, ppmvs):\n if not months or month != months[-1]:\n months.append(month)\n ppmv_sums.append(ppmv)\n counts.append(1)\n else:\n # aggregate monthly sum to produce average\n ppmv_sums[-1] += ppmv\n counts[-1] += 1\n\n months = np.asarray(months).reshape(-1, 1)\n avg_ppmvs = np.asarray(ppmv_sums) / counts\n return months, avg_ppmvs\n\n\nX, y = load_mauna_loa_atmospheric_co2()\n\n# Kernel with parameters given in GPML book\nk1 = 66.0**2 * RBF(length_scale=67.0) # long term smooth rising trend\nk2 = 2.4**2 * RBF(length_scale=90.0) \\\n * ExpSineSquared(length_scale=1.3, periodicity=1.0) # seasonal component\n# medium term irregularity\nk3 = 0.66**2 \\\n * RationalQuadratic(length_scale=1.2, alpha=0.78)\nk4 = 0.18**2 * RBF(length_scale=0.134) \\\n + WhiteKernel(noise_level=0.19**2) # noise terms\nkernel_gpml = k1 + k2 + k3 + k4\n\ngp = GaussianProcessRegressor(kernel=kernel_gpml, alpha=0,\n optimizer=None, normalize_y=True)\ngp.fit(X, y)\n\nprint(\"GPML kernel: %s\" % gp.kernel_)\nprint(\"Log-marginal-likelihood: %.3f\"\n % gp.log_marginal_likelihood(gp.kernel_.theta))\n\n# Kernel with optimized parameters\nk1 = 50.0**2 * RBF(length_scale=50.0) # long term smooth rising trend\nk2 = 2.0**2 * RBF(length_scale=100.0) \\\n * ExpSineSquared(length_scale=1.0, periodicity=1.0,\n periodicity_bounds=\"fixed\") # seasonal component\n# medium term irregularities\nk3 = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.0)\nk4 = 0.1**2 * RBF(length_scale=0.1) \\\n + WhiteKernel(noise_level=0.1**2,\n noise_level_bounds=(1e-3, np.inf)) # noise terms\nkernel = k1 + k2 + k3 + k4\n\ngp = GaussianProcessRegressor(kernel=kernel, alpha=0,\n normalize_y=True)\ngp.fit(X, y)\n\nprint(\"\\nLearned kernel: %s\" % gp.kernel_)\nprint(\"Log-marginal-likelihood: %.3f\"\n % gp.log_marginal_likelihood(gp.kernel_.theta))\n\nX_ = np.linspace(X.min(), X.max() + 30, 1000)[:, np.newaxis]\ny_pred, y_std = gp.predict(X_, return_std=True)\n\n# Illustration\nplt.scatter(X, y, c='k')\nplt.plot(X_, y_pred)\nplt.fill_between(X_[:, 0], y_pred - y_std, y_pred + y_std,\n alpha=0.5, color='k')\nplt.xlim(X_.min(), X_.max())\nplt.xlabel(\"Year\")\nplt.ylabel(r\"CO$_2$ in ppm\")\nplt.title(r\"Atmospheric CO$_2$ concentration at Mauna Loa\")\nplt.tight_layout()\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}