ANOVA using `ModelSpec`#

In this lab we illustrate how to run create specific ANOVA analyses using ModelSpec.

import numpy as np
import pandas as pd

from statsmodels.api import OLS
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec,
                         derived_feature,
                         summarize)

We will carry out two simple ANOVA analyses of the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()

We see that Salary is missing for 59 players. The dropna() method of data frames removes all of the rows that have missing values in any variable (by default — see Hitters.dropna?).

Hitters = Hitters.dropna()
Hitters.columns

Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',
       'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'League', 'Division',
       'PutOuts', 'Assists', 'Errors', 'Salary', 'NewLeague'],
      dtype='object')

Grouping variables#

A look at the description of the data shows that there are both career and 1986 offensive stats, as well as some defensive stats.

Let’s group the offensive into recent and career offensive stats, as well as a group of defensive variables.

confounders = derived_feature(['Division', 'League', 'NewLeague'],
                              name='confounders')
offense_career = derived_feature(['CAtBat', 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks'],
                                 name='offense_career')
offense_1986 = derived_feature(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks'],
                               name='offense_1986')
defense_1986 = derived_feature(['PutOuts', 'Assists', 'Errors'],
                               name='defense_1986')

We’ll first do a sequential ANOVA where terms are added sequentially

design = ModelSpec([confounders, offense_career, defense_1986, offense_1986]).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)

Along with a score we need to specify the search strategy. This is done through the object Stepwise() in the ISLP.models package. The method Stepwise.first_peak() runs forward stepwise until any further additions to the model do not result in an improvement in the evaluation score. Similarly, the method Stepwise.fixed_steps() runs a fixed number of steps of stepwise search.

M = OLS(Y, X).fit()
summarize(M)

	coef	std err	t	P>\|t\|
intercept	148.2187	73.595	2.014	0.045
Division[W]	-116.0404	40.188	-2.887	0.004
League[N]	63.7503	79.006	0.807	0.421
NewLeague[N]	-24.3989	78.843	-0.309	0.757
CAtBat	-0.1887	0.120	-1.572	0.117
CHits	0.1636	0.665	0.246	0.806
CHmRun	-0.1517	1.612	-0.094	0.925
CRuns	1.4716	0.747	1.971	0.050
CRBI	0.8021	0.691	1.161	0.247
CWalks	-0.8124	0.327	-2.481	0.014
PutOuts	0.2827	0.077	3.661	0.000
Assists	0.3755	0.220	1.705	0.089
Errors	-3.2940	4.377	-0.753	0.452
AtBat	-1.9509	0.624	-3.125	0.002
Hits	7.4395	2.363	3.148	0.002
HmRun	4.3449	6.190	0.702	0.483
Runs	-2.3312	2.971	-0.785	0.433
RBI	-1.0670	2.595	-0.411	0.681
Walks	6.2196	1.825	3.409	0.001

We’ll first produce the sequential, or Type I ANOVA results. This builds up a model sequentially and compares two successive models.

df = anova_lm(*[OLS(Y, D).fit() for D in design.build_sequence(Hitters, anova_type='sequential')])
df.index = design.names
df

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
intercept	262.0	5.331911e+07	0.0	NaN	NaN	NaN
confounders	259.0	5.131263e+07	3.0	2.006478e+06	6.741147	2.144265e-04
offense_career	253.0	3.059130e+07	6.0	2.072134e+07	34.808656	1.470455e-30
defense_1986	250.0	2.730614e+07	3.0	3.285156e+06	11.037111	7.880207e-07
offense_1986	244.0	2.420857e+07	6.0	3.097572e+06	5.203444	4.648586e-05

We can similarly compute the Type II ANOVA results which drops each term and compares to the full model.

D_full = design.transform(Hitters)
OLS_full = OLS(Y, D_full).fit()
dfs = []
for d in design.build_sequence(Hitters, anova_type='drop'):
    dfs.append(anova_lm(OLS(Y,d).fit(), OLS_full).iloc[1:])
df = pd.concat(dfs)
df.index = design.names
df

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
intercept	244.0	2.420857e+07	1.0	4.024254e+05	4.056076	4.511037e-02
confounders	244.0	2.420857e+07	3.0	9.661738e+05	3.246046	2.261572e-02
offense_career	244.0	2.420857e+07	6.0	1.051025e+07	17.655596	5.701196e-17
defense_1986	244.0	2.420857e+07	3.0	1.467933e+06	4.931803	2.415732e-03
offense_1986	244.0	2.420857e+07	6.0	3.097572e+06	5.203444	4.648586e-05

ANOVA using ModelSpec

Contents

ANOVA using `ModelSpec`#

Grouping variables#

ANOVA using ModelSpec

Contents

ANOVA using ModelSpec#

Grouping variables#

ANOVA using `ModelSpec`#