Model selection using ModelSpec
#
In this lab we illustrate how to run forward stepwise model selection
using the model specification capability of ModelSpec
.
import numpy as np
import pandas as pd
from statsmodels.api import OLS
from ISLP import load_data
from ISLP.models import (ModelSpec,
Stepwise,
sklearn_selected)
Forward Selection#
We will apply the forward-selection approach to the Hitters
data. We wish to predict a baseball player’s Salary
on the
basis of various statistics associated with performance in the
previous year.
Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()
59
We see that Salary
is missing for 59 players. The
dropna()
method of data frames removes all of the rows that have missing
values in any variable (by default — see Hitters.dropna?
).
Hitters = Hitters.dropna()
Hitters.shape
(263, 20)
We first choose the best model using forward selection based on AIC. This score
is not built in as a metric to sklearn
. We therefore define a function to compute it ourselves, and use
it as a scorer. By default, sklearn
tries to maximize a score, hence
our scoring function computes the negative AIC statistic.
def negAIC(estimator, X, Y):
"Negative AIC"
n, p = X.shape
Yhat = estimator.predict(X)
MSE = np.mean((Y - Yhat)**2)
return n + n * np.log(MSE) + 2 * (p + 1)
We need to estimate the residual variance \(\sigma^2\), which is the first argument in our scoring function above. We will fit the biggest model, using all the variables, and estimate \(\sigma^2\) based on its MSE.
design = ModelSpec(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)
Along with a score we need to specify the search strategy. This is done through the object
Stepwise()
in the ISLP.models
package. The method Stepwise.first_peak()
runs forward stepwise until any further additions to the model do not result
in an improvement in the evaluation score. Similarly, the method Stepwise.fixed_steps()
runs a fixed number of steps of stepwise search.
strategy = Stepwise.first_peak(design,
direction='forward',
max_terms=len(design.terms))
We now fit a linear regression model with Salary
as outcome using forward
selection. To do so, we use the function sklearn_selected()
from the ISLP.models
package. This takes
a model from statsmodels
along with a search strategy and selects a model with its
fit
method. Without specifying a scoring
argument, the score defaults to MSE, and so all 19 variables will be
selected.
hitters_MSE = sklearn_selected(OLS,
strategy)
hitters_MSE.fit(Hitters, Y)
hitters_MSE.selected_state_
('Assists',
'AtBat',
'CAtBat',
'CHits',
'CHmRun',
'CRBI',
'CRuns',
'CWalks',
'Division',
'Errors',
'Hits',
'HmRun',
'League',
'NewLeague',
'PutOuts',
'RBI',
'Runs',
'Walks',
'Years')
Using neg_Cp
results in a smaller model, as expected, with just 4variables selected.
hitters_Cp = sklearn_selected(OLS,
strategy,
scoring=negAIC)
hitters_Cp.fit(Hitters, Y)
hitters_Cp.selected_state_
('Assists', 'Errors', 'League', 'NewLeague')