Random Forest Regression with Python

Last Update: March 9, 2020

Supervised machine learning consists of finding which class output target data belongs to or predicting its value by mapping its optimal relationship with input predictors data. Main supervised learning tasks are classification and regression.

This topic is part of Regression Machine Learning with Python course. Feel free to take a look at Course Curriculum.

This tutorial has an educational and informational purpose and doesn’t constitute any type of forecasting, business, trading or investment advice. All content, including code and data, is presented for personal educational use exclusively and with no guarantee of exactness of completeness. Past performance doesn’t guarantee future results. Please read full Disclaimer.

An example of supervised learning meta-algorithm is random forest [1] which consists of predicting output target feature average by bootstrap aggregation or bagging of independently built decision trees. Bootstrap aggregation or bagging is used for lowering variance error source of independently built decision trees.

1. Trees algorithm definition.

Classification and regression trees (CART) algorithm consists of greedy top-down approach for finding optimal recursive binary node splits by locally minimizing variance at terminal nodes measured through sum of squared errors function at each stage.

1.1. Trees algorithm formula notation.

min\left ( mse \right )=\sum_{t=1}^{n}\left ( y_{t}-\hat{y}_{t} \right )^{2}

\hat{y}_{t}=\frac{1}{m}\sum_{t=1}^{m}\left ( y_{t} \right )

Where y_{t} = output target feature data, \hat{y}_{t} = terminal node output target feature mean, n = number of observations, m = number of observations in terminal node.

2. Trees bagging algorithm.

Trees bagging algorithm consists of predicting output target feature of independently built decision trees by calculating their arithmetic mean. For random forests, a combination of random feature selection and bootstrap aggregation or bagging algorithms is used. Bootstrap consists of random sampling with replacement.

2.1. Trees bagging algorithm formula notation.

\hat{\bar{y}}_{t}=\frac{1}{k}\sum_{i=1}^{k}\hat{y}_{t}

Where \hat{\bar{y}}_{t} = independently built decision trees output target feature mean prediction, \hat{y}_{t} = terminal node output target feature mean, k = number of independently built decision trees.

3. Python code example.

3.1. Import Python packages [2].

import numpy as np
import pandas as pd
import sklearn.ensemble as ml

3.2. Random forest regression data reading, target and predictor features creation, training and testing ranges delimiting.

  • Data: S&P 500® index replicating ETF (ticker symbol: SPY) daily adjusted close prices (2007-2015).
  • Data daily arithmetic returns used for target feature (current day) and predictor feature (previous day).
  • Target and predictor features creation, training and testing ranges delimiting not fixed and only included for educational purposes.
spy = pd.read_csv('Data//Random-Forest-Regression-Data.txt', index_col='Date', parse_dates=True)
rspy = spy.pct_change(1)
rspy.columns = ['rspy']
rspy1 = rspy.shift(1)
rspy1.columns = ['rspy1']
rspyall = rspy
rspyall = rspyall.join(rspy1)
rspyall = rspyall.dropna()
rspyt = rspyall['2007-01-01':'2014-01-01']
rspyf = rspyall['2014-01-01':'2016-01-01']

3.3. Random forest regression fitting and output.

  • Random forest fitting within training range.
  • Random forest number of independently built decision trees, independently built decision trees maximum depth and maximum number of input predictor features randomly sampled not fixed and only included for educational purposes.
  • Random forest output results might be different depending on bootstrap random number generation seed.
rft1 = ml.RandomForestRegressor(n_estimators=1, criterion='mse', max_depth=1, max_features=1,
                               bootstrap=True).fit(np.array(rspyt['rspy1']).reshape(-1, 
                               1), rspyt['rspy'])
rft2 = ml.RandomForestRegressor(n_estimators=2, criterion='mse', max_depth=1, max_features=1,
                               bootstrap=True).fit(np.array(rspyt['rspy1']).reshape(-1, 
                               1), rspyt['rspy'])
In:
rfts1 = rft1.score(np.array(rspyt['rspy1']).reshape(-1, 1), rspyt['rspy'])
rfts2 = rft2.score(np.array(rspyt['rspy1']).reshape(-1, 1), rspyt['rspy'])
print('== Random Forest Regression Score ==')
print('')
print('Decision Trees: 1 , Score:', np.round(rfts1, 4))
print('Decision Trees: 2 , Score:', np.round(rfts2, 4))
Out:
== Random Forest Regression Score ==

Decision Trees: 1 , Score: 0.0151
Decision Trees: 2 , Score: 0.0196
4. References.

[1] L. Breiman. “Random Forests”. Machine Learning. 2001.

[2] Travis E, Oliphant. “A guide to NumPy”. USA: Trelgol Publishing. 2006.

Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. “The NumPy Array: A Structure for Efficient Numerical Computation”. Computing in Science & Engineering. 2011.


Wes McKinney. “Data Structures for Statistical Computing in Python.” Proceedings of the 9th Python in Science Conference. 2010.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. “Scikit-learn: Machine Learning in Python”. Journal of Machine Learning Research. 2011.