K Nearest Neighbors Regression with Python

Last Update: December 30, 2020

Supervised machine learning consists of finding which class output target data belongs to or predicting its value by mapping its optimal relationship with input predictors data. Main supervised learning tasks are classification and regression.

This topic is part of Regression Machine Learning with Python course. Feel free to take a look at Course Curriculum.

This tutorial has an educational and informational purpose and doesn’t constitute any type of forecasting, business, trading or investment advice. All content, including code and data, is presented for personal educational use exclusively and with no guarantee of exactness of completeness. Past performance doesn’t guarantee future results. Please read full Disclaimer.

An example of supervised learning algorithm is k nearest neighbors [1] which consists of predicting output target feature average by storing output target and input predictor features nearest neighbors data. Time series cross-validation is used for optimal number of nearest neighbors estimation or fine tuning.

1. Distance function definition.

Distance function consists of measuring similarity between output target and input predictor features data which can be done through Euclidean, Manhattan or Minkowski functions.

1.1. Euclidean distance function formula notation.


Where d_{Euc}(x,y) = input predictor and output target features data Euclidean distance, x_{t} = input predictor features data, y_{t} = output target feature data, n = number of observations.

2. Nearest neighbors algorithm definition.

Nearest neighbors algorithm consists of searching for output target feature nearest neighbors input predictor features data based on similarity metrics. For regression, ball tree, k-dimensional tree or brute force algorithms are used.

  • Algorithm objective consists of calculating average output target feature prediction of equal weighted or inverse of distance weighted nearest neighbors. For regression, average or arithmetic mean function is used.

2.1. Nearest neighbors algorithm formula notation.


Where \hat{y}_{t} = output target feature prediction, y_{t} = nearest neighbors position output target feature data, k = number of nearest neighbors.

3. Python code example.

3.1. Import Python packages [2].

import numpy as np
import pandas as pd
import sklearn.neighbors as ml

3.2. K nearest neighbors regression data reading, target and predictor features creation, training and testing ranges delimiting.

  • Data: S&P 500® index replicating ETF (ticker symbol: SPY) daily adjusted close prices (2007-2015).
  • Data daily arithmetic returns used for target feature (current day) and predictor feature (previous day).
  • Target and predictor features creation, training and testing ranges delimiting not fixed and only included for educational purposes.
spy = pd.read_csv('Data//K-Nearest-Neighbors-Regression-Data.txt', index_col='Date', parse_dates=True)
rspy = spy.pct_change(1)
rspy.columns = ['rspy']
rspy1 = rspy.shift(1)
rspy1.columns = ['rspy1']
rspyall = rspy
rspyall = rspyall.join(rspy1)
rspyall = rspyall.dropna()
rspyt = rspyall['2007-01-01':'2014-01-01']
rspyf = rspyall['2014-01-01':'2016-01-01']

3.3. K nearest neighbors regression fitting, mean squared error calculation and output.

  • K nearest neighbors fitting and mean squared error calculation within training range.
  • K nearest neighbors number of neighbors, weight function, algorithm, distance metric not fixed and only included for educational purposes.
knnt1 = ml.KNeighborsRegressor(n_neighbors=1, weights='uniform', algorithm='auto',
                               metric='euclidean',).fit(np.array(rspyt['rspy1']).reshape(-1, 1), rspyt['rspy'])
knnt2 = ml.KNeighborsRegressor(n_neighbors=2, weights='uniform', algorithm='auto',
                               metric='euclidean').fit(np.array(rspyt['rspy1']).reshape(-1, 1), rspyt['rspy'])
knntmse1 = ((rspyt['rspy'] - knnt1.predict(np.array(rspyt['rspy1']).reshape(-1, 1))) ** 2).mean()
knntmse2 = ((rspyt['rspy'] - knnt2.predict(np.array(rspyt['rspy1']).reshape(-1, 1))) ** 2).mean()
print('== K Nearest Neighbors Regression MSE ==')
print('Nearest Neighbors: 1 , MSE:', np.round(knntmse1, 8))
print('Nearest Neighbors: 2 , MSE:', np.round(knntmse2, 8))
== K Nearest Neighbors Regression MSE ==

Nearest Neighbors: 1 , MSE: 6.6e-07
Nearest Neighbors: 2 , MSE: 0.00010235
4. References.

[1] N.S. Altman. “An introduction to kernel and nearest-neighbor nonparametric regression“. The American Statistician. 1992.

[2] Travis E, Oliphant. “A guide to NumPy”. USA: Trelgol Publishing. 2006.

Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. “The NumPy Array: A Structure for Efficient Numerical Computation”. Computing in Science & Engineering. 2011.

Wes McKinney. “Data Structures for Statistical Computing in Python.” Proceedings of the 9th Python in Science Conference. 2010.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. “Scikit-learn: Machine Learning in Python”. Journal of Machine Learning Research. 2011.