Model selection using ModelSpec#
In this lab we illustrate how to run forward stepwise model selection
using the model specification capability of ModelSpec.
import numpy as np
import pandas as pd
from statsmodels.api import OLS
from ISLP import load_data
from ISLP.models import (ModelSpec,
Stepwise,
sklearn_selected)
Forward Selection#
We will apply the forward-selection approach to the Hitters
data. We wish to predict a baseball player’s Salary on the
basis of various statistics associated with performance in the
previous year.
Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()
59
We see that Salary is missing for 59 players. The
dropna() method of data frames removes all of the rows that have missing
values in any variable (by default — see Hitters.dropna?).
Hitters = Hitters.dropna()
Hitters.shape
(263, 20)
We first choose the best model using forward selection based on AIC. This score
is not built in as a metric to sklearn. We therefore define a function to compute it ourselves, and use
it as a scorer. By default, sklearn tries to maximize a score, hence
our scoring function computes the negative AIC statistic.
def negAIC(estimator, X, Y):
"Negative AIC"
n, p = X.shape
Yhat = estimator.predict(X)
MSE = np.mean((Y - Yhat)**2)
return n + n * np.log(MSE) + 2 * (p + 1)
We need to estimate the residual variance \(\sigma^2\), which is the first argument in our scoring function above. We will fit the biggest model, using all the variables, and estimate \(\sigma^2\) based on its MSE.
design = ModelSpec(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)
Along with a score we need to specify the search strategy. This is done through the object
Stepwise() in the ISLP.models package. The method Stepwise.first_peak()
runs forward stepwise until any further additions to the model do not result
in an improvement in the evaluation score. Similarly, the method Stepwise.fixed_steps()
runs a fixed number of steps of stepwise search.
strategy = Stepwise.first_peak(design,
direction='forward',
max_terms=len(design.terms))
We now fit a linear regression model with Salary as outcome using forward
selection. To do so, we use the function sklearn_selected() from the ISLP.models package. This takes
a model from statsmodels along with a search strategy and selects a model with its
fit method. Without specifying a scoring argument, the score defaults to MSE, and so all 19 variables will be
selected.
hitters_MSE = sklearn_selected(OLS,
strategy)
hitters_MSE.fit(Hitters, Y)
hitters_MSE.selected_state_
('Assists',
'AtBat',
'CAtBat',
'CHits',
'CHmRun',
'CRBI',
'CRuns',
'CWalks',
'Division',
'Errors',
'Hits',
'HmRun',
'League',
'NewLeague',
'PutOuts',
'RBI',
'Runs',
'Walks',
'Years')
Using neg_Cp results in a smaller model, as expected, with just 4variables selected.
hitters_Cp = sklearn_selected(OLS,
strategy,
scoring=negAIC)
hitters_Cp.fit(Hitters, Y)
hitters_Cp.selected_state_
('Assists', 'Errors', 'League', 'NewLeague')