Gradient Boosting Machine Regression with Python

Last Update: September 22, 2020

Algorithm learning consists of algorithm training within training data subset for optimal parameters estimation and algorithm testing within testing data subset using previously optimized parameters. This corresponds to a supervised regression machine learning task.

This topic is part of Machine Trading Analysis with Python course. Feel free to take a look at Course Curriculum.

This tutorial has an educational and informational purpose and doesn’t constitute any type of trading or investment advice. All content, including code and data, is presented for personal educational use exclusively and with no guarantee of exactness of completeness. Past performance doesn’t guarantee future results. Please read full Disclaimer.

An example of supervised learning meta-algorithm is gradient boosting machine [1] which consists of predicting output target feature by boosting of optimally weighted sequentially built decision trees. Boosting is used for simultaneously lowering squared bias error and variance error sources of sequentially built decision trees.

1. Trees algorithm definition.

Classification and regression trees (CART) algorithm consists of greedy top-down approach for finding optimal recursive binary node splits by locally minimizing variance at terminal nodes measured through sum of squared errors function at each stage.

1.1. Trees algorithm formula notation.

min\left ( sse \right )=\sum_{t=1}^{n}\left ( y_{t}-\hat{y}_{t} \right )^{2}

\hat{y}_{t}=\frac{1}{m}\sum_{t=1}^{m}y_{t}

Where y_{t} = output target feature data, \hat{y}_{t} = terminal node output target feature mean, n = number of observations, m = number of observations in terminal node.

2. Tree boosting algorithm.

Tree boosting algorithm consists of predicting output target feature of weighted sequentially built decision trees.

  • Gradient descent algorithm consists of finding local optimal weight coefficients of sequentially built decision trees by locally minimizing sum of squared errors, sum of absolute errors or Huber loss function.

2.1. Tree boosting algorithm formula notation.

min(sse)=\sum_{t=1}^{n}(y_{t}-\hat{y}_{k(t)})^2

\hat{y}_{k(t)}=\sum_{i=1}^{k}\gamma \omega_{i}\hat{y}_{i(t)}

Where y_{t} = output target feature data, \hat{y}_{k(t)} = sequentially built decision trees weighted output target feature prediction, \gamma = learning rate regularization coefficient, \omega_{i} = local optimal sequentially built decision trees weight coefficients, \hat{y}_{i(t)} = sequentially built decision trees output target feature prediction, k = number of sequentially built decision trees.

3. Python code example.

3.1. Import Python packages [2].

import numpy as np
import pandas as pd
import sklearn.ensemble as ml

3.2. Gradient boosting machine regression data reading, target and predictor features creation, training and testing ranges delimiting.

  • Data: S&P 500® index replicating ETF (ticker symbol: SPY) daily adjusted close prices (2007-2015).
  • Data daily arithmetic returns used for target feature (current day) and predictor feature (previous day).
  • Target and predictor features creation, training and testing ranges delimiting not fixed and only included for educational purposes.
spy = pd.read_csv('Data//Gradient-Boosting-Machine-Regression-Data.txt', index_col='Date', parse_dates=True)
rspy = spy.pct_change(1)
rspy.columns = ['rspy']
rspy1 = rspy.shift(1)
rspy1.columns = ['rspy1']
rspyall = rspy
rspyall = rspyall.join(rspy1)
rspyall = rspyall.dropna()
rspyt = rspyall['2007-01-01':'2014-01-01']
rspyf = rspyall['2014-01-01':'2016-01-01']

3.3. Gradient boosting machine regression fitting and output.

  • Gradient boosting machine fitting within training range.
  • Gradient boosting machine loss function, learning rate regularization coefficient, number of sequentially built decision trees, sequentially built decision trees maximum depth not fixed and only included for educational purposes.
gbmt = ml.GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=2,
                                    max_depth=1).fit(np.array(rspyt['rspy1']).reshape(-1, 
                                    1),rspyt['rspy'])
In:
print('== Gradient Boosting Machine Regression Score ==')
print('')
print(gbmt.train_score_)
Out:
== Gradient Boosting Machine Regression Score ==

[0.00021535 0.00021469]
4. References

[1] Jerome H. Friedman. “Greedy Function Approximation: A Gradient Boosting Machine”. The Annals of Statistics. 2001.

[2] Travis E, Oliphant. “A guide to NumPy”. USA: Trelgol Publishing. 2006.

Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. “The NumPy Array: A Structure for Efficient Numerical Computation”. Computing in Science & Engineering. 2011.

Wes McKinney. “Data Structures for Statistical Computing in Python.” Proceedings of the 9th Python in Science Conference. 2010.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. “Scikit-learn: Machine Learning in Python”. Journal of Machine Learning Research. 2011.