ANOVA using ModelSpec
#
In this lab we illustrate how to run create specific ANOVA analyses
using ModelSpec
.
import numpy as np
import pandas as pd
from statsmodels.api import OLS
from statsmodels.stats.anova import anova_lm
from ISLP import load_data
from ISLP.models import (ModelSpec,
derived_feature,
summarize)
We will carry out two simple ANOVA analyses of the Hitters
data.
We wish to predict a baseball player’s Salary
on the
basis of various statistics associated with performance in the
previous year.
Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()
59
We see that Salary
is missing for 59 players. The
dropna()
method of data frames removes all of the rows that have missing
values in any variable (by default — see Hitters.dropna?
).
Hitters = Hitters.dropna()
Hitters.columns
Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',
'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'League', 'Division',
'PutOuts', 'Assists', 'Errors', 'Salary', 'NewLeague'],
dtype='object')
Grouping variables#
A look at the description of the data shows that there are both career and 1986 offensive stats, as well as some defensive stats.
Let’s group the offensive into recent and career offensive stats, as well as a group of defensive variables.
confounders = derived_feature(['Division', 'League', 'NewLeague'],
name='confounders')
offense_career = derived_feature(['CAtBat', 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks'],
name='offense_career')
offense_1986 = derived_feature(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks'],
name='offense_1986')
defense_1986 = derived_feature(['PutOuts', 'Assists', 'Errors'],
name='defense_1986')
We’ll first do a sequential ANOVA where terms are added sequentially
design = ModelSpec([confounders, offense_career, defense_1986, offense_1986]).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)
Along with a score we need to specify the search strategy. This is done through the object
Stepwise()
in the ISLP.models
package. The method Stepwise.first_peak()
runs forward stepwise until any further additions to the model do not result
in an improvement in the evaluation score. Similarly, the method Stepwise.fixed_steps()
runs a fixed number of steps of stepwise search.
M = OLS(Y, X).fit()
summarize(M)
coef | std err | t | P>|t| | |
---|---|---|---|---|
intercept | 148.2187 | 73.595 | 2.014 | 0.045 |
Division[W] | -116.0404 | 40.188 | -2.887 | 0.004 |
League[N] | 63.7503 | 79.006 | 0.807 | 0.421 |
NewLeague[N] | -24.3989 | 78.843 | -0.309 | 0.757 |
CAtBat | -0.1887 | 0.120 | -1.572 | 0.117 |
CHits | 0.1636 | 0.665 | 0.246 | 0.806 |
CHmRun | -0.1517 | 1.612 | -0.094 | 0.925 |
CRuns | 1.4716 | 0.747 | 1.971 | 0.050 |
CRBI | 0.8021 | 0.691 | 1.161 | 0.247 |
CWalks | -0.8124 | 0.327 | -2.481 | 0.014 |
PutOuts | 0.2827 | 0.077 | 3.661 | 0.000 |
Assists | 0.3755 | 0.220 | 1.705 | 0.089 |
Errors | -3.2940 | 4.377 | -0.753 | 0.452 |
AtBat | -1.9509 | 0.624 | -3.125 | 0.002 |
Hits | 7.4395 | 2.363 | 3.148 | 0.002 |
HmRun | 4.3449 | 6.190 | 0.702 | 0.483 |
Runs | -2.3312 | 2.971 | -0.785 | 0.433 |
RBI | -1.0670 | 2.595 | -0.411 | 0.681 |
Walks | 6.2196 | 1.825 | 3.409 | 0.001 |
We’ll first produce the sequential, or Type I ANOVA results. This builds up a model sequentially and compares two successive models.
df = anova_lm(*[OLS(Y, D).fit() for D in design.build_sequence(Hitters, anova_type='sequential')])
df.index = design.names
df
df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
---|---|---|---|---|---|---|
intercept | 262.0 | 5.331911e+07 | 0.0 | NaN | NaN | NaN |
confounders | 259.0 | 5.131263e+07 | 3.0 | 2.006478e+06 | 6.741147 | 2.144265e-04 |
offense_career | 253.0 | 3.059130e+07 | 6.0 | 2.072134e+07 | 34.808656 | 1.470455e-30 |
defense_1986 | 250.0 | 2.730614e+07 | 3.0 | 3.285156e+06 | 11.037111 | 7.880207e-07 |
offense_1986 | 244.0 | 2.420857e+07 | 6.0 | 3.097572e+06 | 5.203444 | 4.648586e-05 |
We can similarly compute the Type II ANOVA results which drops each term and compares to the full model.
D_full = design.transform(Hitters)
OLS_full = OLS(Y, D_full).fit()
dfs = []
for d in design.build_sequence(Hitters, anova_type='drop'):
dfs.append(anova_lm(OLS(Y,d).fit(), OLS_full).iloc[1:])
df = pd.concat(dfs)
df.index = design.names
df
df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
---|---|---|---|---|---|---|
intercept | 244.0 | 2.420857e+07 | 1.0 | 4.024254e+05 | 4.056076 | 4.511037e-02 |
confounders | 244.0 | 2.420857e+07 | 3.0 | 9.661738e+05 | 3.246046 | 2.261572e-02 |
offense_career | 244.0 | 2.420857e+07 | 6.0 | 1.051025e+07 | 17.655596 | 5.701196e-17 |
defense_1986 | 244.0 | 2.420857e+07 | 3.0 | 1.467933e+06 | 4.931803 | 2.415732e-03 |
offense_1986 | 244.0 | 2.420857e+07 | 6.0 | 3.097572e+06 | 5.203444 | 4.648586e-05 |