ANOVA using ModelSpec#

In this lab we illustrate how to run create specific ANOVA analyses using ModelSpec.

import numpy as np
import pandas as pd

from statsmodels.api import OLS
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec,
                         derived_feature,
                         summarize)

We will carry out two simple ANOVA analyses of the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()
59

We see that Salary is missing for 59 players. The dropna() method of data frames removes all of the rows that have missing values in any variable (by default — see Hitters.dropna?).

Hitters = Hitters.dropna()
Hitters.columns
Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',
       'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'League', 'Division',
       'PutOuts', 'Assists', 'Errors', 'Salary', 'NewLeague'],
      dtype='object')

Grouping variables#

A look at the description of the data shows that there are both career and 1986 offensive stats, as well as some defensive stats.

Let’s group the offensive into recent and career offensive stats, as well as a group of defensive variables.

confounders = derived_feature(['Division', 'League', 'NewLeague'],
                              name='confounders')
offense_career = derived_feature(['CAtBat', 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks'],
                                 name='offense_career')
offense_1986 = derived_feature(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks'],
                               name='offense_1986')
defense_1986 = derived_feature(['PutOuts', 'Assists', 'Errors'],
                               name='defense_1986')

We’ll first do a sequential ANOVA where terms are added sequentially

design = ModelSpec([confounders, offense_career, defense_1986, offense_1986]).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)

Along with a score we need to specify the search strategy. This is done through the object Stepwise() in the ISLP.models package. The method Stepwise.first_peak() runs forward stepwise until any further additions to the model do not result in an improvement in the evaluation score. Similarly, the method Stepwise.fixed_steps() runs a fixed number of steps of stepwise search.

M = OLS(Y, X).fit()
summarize(M)
coef std err t P>|t|
intercept 148.2187 73.595 2.014 0.045
Division[W] -116.0404 40.188 -2.887 0.004
League[N] 63.7503 79.006 0.807 0.421
NewLeague[N] -24.3989 78.843 -0.309 0.757
CAtBat -0.1887 0.120 -1.572 0.117
CHits 0.1636 0.665 0.246 0.806
CHmRun -0.1517 1.612 -0.094 0.925
CRuns 1.4716 0.747 1.971 0.050
CRBI 0.8021 0.691 1.161 0.247
CWalks -0.8124 0.327 -2.481 0.014
PutOuts 0.2827 0.077 3.661 0.000
Assists 0.3755 0.220 1.705 0.089
Errors -3.2940 4.377 -0.753 0.452
AtBat -1.9509 0.624 -3.125 0.002
Hits 7.4395 2.363 3.148 0.002
HmRun 4.3449 6.190 0.702 0.483
Runs -2.3312 2.971 -0.785 0.433
RBI -1.0670 2.595 -0.411 0.681
Walks 6.2196 1.825 3.409 0.001

We’ll first produce the sequential, or Type I ANOVA results. This builds up a model sequentially and compares two successive models.

df = anova_lm(*[OLS(Y, D).fit() for D in design.build_sequence(Hitters, anova_type='sequential')])
df.index = design.names
df
df_resid ssr df_diff ss_diff F Pr(>F)
intercept 262.0 5.331911e+07 0.0 NaN NaN NaN
confounders 259.0 5.131263e+07 3.0 2.006478e+06 6.741147 2.144265e-04
offense_career 253.0 3.059130e+07 6.0 2.072134e+07 34.808656 1.470455e-30
defense_1986 250.0 2.730614e+07 3.0 3.285156e+06 11.037111 7.880207e-07
offense_1986 244.0 2.420857e+07 6.0 3.097572e+06 5.203444 4.648586e-05

We can similarly compute the Type II ANOVA results which drops each term and compares to the full model.

D_full = design.transform(Hitters)
OLS_full = OLS(Y, D_full).fit()
dfs = []
for d in design.build_sequence(Hitters, anova_type='drop'):
    dfs.append(anova_lm(OLS(Y,d).fit(), OLS_full).iloc[1:])
df = pd.concat(dfs)
df.index = design.names
df
df_resid ssr df_diff ss_diff F Pr(>F)
intercept 244.0 2.420857e+07 1.0 4.024254e+05 4.056076 4.511037e-02
confounders 244.0 2.420857e+07 3.0 9.661738e+05 3.246046 2.261572e-02
offense_career 244.0 2.420857e+07 6.0 1.051025e+07 17.655596 5.701196e-17
defense_1986 244.0 2.420857e+07 3.0 1.467933e+06 4.931803 2.415732e-03
offense_1986 244.0 2.420857e+07 6.0 3.097572e+06 5.203444 4.648586e-05