Building design matrices with ModelSpec
#
The ISLP
package provides a facility to build design
matrices for regression and classification tasks. It provides similar functionality to the formula
notation of R
though uses python objects rather than specification through the special formula syntax.
Related tools include patsy
and ColumnTransformer
from sklearn.compose
.
Perhaps the most common use is to extract some columns from a pd.DataFrame
and
produce a design matrix, optionally with an intercept.
import pandas as pd
import numpy as np
from ISLP import load_data
from ISLP.models import (ModelSpec,
summarize,
Column,
Feature,
build_columns)
import statsmodels.api as sm
Carseats = load_data('Carseats')
Carseats.columns
Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
dtype='object')
We’ll first build a design matrix that we can use to model Sales
in terms of the categorical variable ShelveLoc
and Price
.
We see first that ShelveLoc
is a categorical variable:
Carseats['ShelveLoc']
0 Bad
1 Good
2 Medium
3 Medium
4 Bad
...
395 Good
396 Medium
397 Medium
398 Bad
399 Good
Name: ShelveLoc, Length: 400, dtype: category
Categories (3, object): ['Bad', 'Good', 'Medium']
This is recognized by ModelSpec
and only 2 columns are added for the three levels. The
default behavior is to drop the first level of the categories. Later,
we will show other contrasts of the 3 columns can be produced.
This simple example below illustrates how the first argument (its terms
) is
used to construct a design matrix.
MS = ModelSpec(['ShelveLoc', 'Price'])
X = MS.fit_transform(Carseats)
X.iloc[:10]
intercept | ShelveLoc[Good] | ShelveLoc[Medium] | Price | |
---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 120 |
1 | 1.0 | 1.0 | 0.0 | 83 |
2 | 1.0 | 0.0 | 1.0 | 80 |
3 | 1.0 | 0.0 | 1.0 | 97 |
4 | 1.0 | 0.0 | 0.0 | 128 |
5 | 1.0 | 0.0 | 0.0 | 72 |
6 | 1.0 | 0.0 | 1.0 | 108 |
7 | 1.0 | 1.0 | 0.0 | 120 |
8 | 1.0 | 0.0 | 1.0 | 124 |
9 | 1.0 | 0.0 | 1.0 | 124 |
We note that a column has been added for the intercept by default. This can be changed using the
intercept
argument.
MS_no1 = ModelSpec(['ShelveLoc', 'Price'], intercept=False)
MS_no1.fit_transform(Carseats)[:10]
ShelveLoc[Good] | ShelveLoc[Medium] | Price | |
---|---|---|---|
0 | 0.0 | 0.0 | 120 |
1 | 1.0 | 0.0 | 83 |
2 | 0.0 | 1.0 | 80 |
3 | 0.0 | 1.0 | 97 |
4 | 0.0 | 0.0 | 128 |
5 | 0.0 | 0.0 | 72 |
6 | 0.0 | 1.0 | 108 |
7 | 1.0 | 0.0 | 120 |
8 | 0.0 | 1.0 | 124 |
9 | 0.0 | 1.0 | 124 |
We see that ShelveLoc
still only contributes
two columns to the design. The ModelSpec
object does no introspection of its arguments to effectively include an intercept term
in the column space of the design matrix.
To include this intercept via ShelveLoc
we can use 3 columns to encode this categorical variable. Following the nomenclature of
R
, we call this a Contrast
of the categorical variable.
from ISLP.models import contrast
shelve = contrast('ShelveLoc', None)
MS_contr = ModelSpec([shelve, 'Price'], intercept=False)
MS_contr.fit_transform(Carseats)[:10]
ShelveLoc[Bad] | ShelveLoc[Good] | ShelveLoc[Medium] | Price | |
---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 120 |
1 | 0.0 | 1.0 | 0.0 | 83 |
2 | 0.0 | 0.0 | 1.0 | 80 |
3 | 0.0 | 0.0 | 1.0 | 97 |
4 | 1.0 | 0.0 | 0.0 | 128 |
5 | 1.0 | 0.0 | 0.0 | 72 |
6 | 0.0 | 0.0 | 1.0 | 108 |
7 | 0.0 | 1.0 | 0.0 | 120 |
8 | 0.0 | 0.0 | 1.0 | 124 |
9 | 0.0 | 0.0 | 1.0 | 124 |
This example above illustrates that columns need not be identified by name in terms
. The basic
role of an item in the terms
sequence is a description of how to extract a column
from a columnar data object, usually a pd.DataFrame
.
shelve
Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=(), encoder=Contrast(method=None))
The Column
object can be used to directly extract relevant columns from a pd.DataFrame
. If the encoder
field is not
None
, then the extracted columns will be passed through encoder
.
The get_columns
method produces these columns as well as names for the columns.
shelve.get_columns(Carseats)
(array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
...,
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]]),
['ShelveLoc[Bad]', 'ShelveLoc[Good]', 'ShelveLoc[Medium]'])
Let’s now fit a simple OLS model with this design.
X = MS_contr.transform(Carseats)
Y = Carseats['Sales']
M_ols = sm.OLS(Y, X).fit()
summarize(M_ols)
coef | std err | t | P>|t| | |
---|---|---|---|---|
ShelveLoc[Bad] | 12.0018 | 0.503 | 23.839 | 0.0 |
ShelveLoc[Good] | 16.8976 | 0.522 | 32.386 | 0.0 |
ShelveLoc[Medium] | 13.8638 | 0.487 | 28.467 | 0.0 |
Price | -0.0567 | 0.004 | -13.967 | 0.0 |
Interactions#
One of the common uses of formulae in R
is to specify interactions between variables.
This is done in ModelSpec
by including a tuple in the terms
argument.
ModelSpec([(shelve, 'Price'), 'Price']).fit_transform(Carseats).iloc[:10]
intercept | ShelveLoc[Bad]:Price | ShelveLoc[Good]:Price | ShelveLoc[Medium]:Price | Price | |
---|---|---|---|---|---|
0 | 1.0 | 120.0 | 0.0 | 0.0 | 120 |
1 | 1.0 | 0.0 | 83.0 | 0.0 | 83 |
2 | 1.0 | 0.0 | 0.0 | 80.0 | 80 |
3 | 1.0 | 0.0 | 0.0 | 97.0 | 97 |
4 | 1.0 | 128.0 | 0.0 | 0.0 | 128 |
5 | 1.0 | 72.0 | 0.0 | 0.0 | 72 |
6 | 1.0 | 0.0 | 0.0 | 108.0 | 108 |
7 | 1.0 | 0.0 | 120.0 | 0.0 | 120 |
8 | 1.0 | 0.0 | 0.0 | 124.0 | 124 |
9 | 1.0 | 0.0 | 0.0 | 124.0 | 124 |
The above design matrix is clearly rank deficient, as ModelSpec
has not inspected the formula
and attempted to produce a corresponding matrix that may or may not match a user’s intent.
Ordinal variables#
Ordinal variables are handled by a corresponding encoder)
Carseats['OIncome'] = pd.cut(Carseats['Income'],
[0,50,90,200],
labels=['L','M','H'])
MS_order = ModelSpec(['OIncome']).fit(Carseats)
Part of the fit
method of ModelSpec
involves inspection of the columns of Carseats
.
The results of that inspection can be found in the column_info_
attribute:
MS_order.column_info_
{'Sales': Column(idx='Sales', name='Sales', is_categorical=False, is_ordinal=False, columns=('Sales',), encoder=None),
'CompPrice': Column(idx='CompPrice', name='CompPrice', is_categorical=False, is_ordinal=False, columns=('CompPrice',), encoder=None),
'Income': Column(idx='Income', name='Income', is_categorical=False, is_ordinal=False, columns=('Income',), encoder=None),
'Advertising': Column(idx='Advertising', name='Advertising', is_categorical=False, is_ordinal=False, columns=('Advertising',), encoder=None),
'Population': Column(idx='Population', name='Population', is_categorical=False, is_ordinal=False, columns=('Population',), encoder=None),
'Price': Column(idx='Price', name='Price', is_categorical=False, is_ordinal=False, columns=('Price',), encoder=None),
'ShelveLoc': Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=('ShelveLoc[Good]', 'ShelveLoc[Medium]'), encoder=Contrast()),
'Age': Column(idx='Age', name='Age', is_categorical=False, is_ordinal=False, columns=('Age',), encoder=None),
'Education': Column(idx='Education', name='Education', is_categorical=False, is_ordinal=False, columns=('Education',), encoder=None),
'Urban': Column(idx='Urban', name='Urban', is_categorical=True, is_ordinal=False, columns=('Urban[Yes]',), encoder=Contrast()),
'US': Column(idx='US', name='US', is_categorical=True, is_ordinal=False, columns=('US[Yes]',), encoder=Contrast()),
'OIncome': Column(idx='OIncome', name='OIncome', is_categorical=True, is_ordinal=True, columns=('OIncome',), encoder=OrdinalEncoder())}
Structure of a ModelSpec
#
The first argument to ModelSpec
is stored as the terms
attribute. Under the hood,
this sequence is inspected to produce the terms_
attribute which specify the objects
that will ultimately create the design matrix.
MS = ModelSpec(['ShelveLoc', 'Price'])
MS.fit(Carseats)
MS.terms_
[Feature(variables=('ShelveLoc',), name='ShelveLoc', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),
Feature(variables=('Price',), name='Price', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]
Each element of terms_
should be a Feature
which describes a set of columns to be extracted from
a columnar data form as well as possible a possible encoder.
shelve_var = MS.terms_[0]
We can find the columns associated to each term using the build_columns
method of ModelSpec
:
df, names = build_columns(MS.column_info_,
Carseats,
shelve_var)
df
ShelveLoc[Good] | ShelveLoc[Medium] | |
---|---|---|
0 | 0.0 | 0.0 |
1 | 1.0 | 0.0 |
2 | 0.0 | 1.0 |
3 | 0.0 | 1.0 |
4 | 0.0 | 0.0 |
... | ... | ... |
395 | 1.0 | 0.0 |
396 | 0.0 | 1.0 |
397 | 0.0 | 1.0 |
398 | 0.0 | 0.0 |
399 | 1.0 | 0.0 |
400 rows × 2 columns
The design matrix is constructed by running through terms_
and concatenating the corresponding columns.
Feature
objects#
Note that Feature
objects have a tuple of variables
as well as an encoder
attribute. The
tuple of variables
first creates a concatenated dataframe from all corresponding variables and then
is run through encoder.transform
. The encoder.fit
method of each Feature
is run once during
the call to ModelSpec.fit
.
new_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=None)
build_columns(MS.column_info_,
Carseats,
new_var)[0]
Price | Income | OIncome | |
---|---|---|---|
0 | 120.0 | 73.0 | 2.0 |
1 | 83.0 | 48.0 | 1.0 |
2 | 80.0 | 35.0 | 1.0 |
3 | 97.0 | 100.0 | 0.0 |
4 | 128.0 | 64.0 | 2.0 |
... | ... | ... | ... |
395 | 128.0 | 108.0 | 0.0 |
396 | 120.0 | 23.0 | 1.0 |
397 | 159.0 | 26.0 | 1.0 |
398 | 95.0 | 79.0 | 2.0 |
399 | 120.0 | 37.0 | 1.0 |
400 rows × 3 columns
Let’s now transform these columns with an encoder. Within ModelSpec
we will first build the
arrays above and then call pca.fit
and finally pca.transform
within design.build_columns
.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(build_columns(MS.column_info_, Carseats, new_var)[0]) # this is done within `ModelSpec.fit`
pca_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=pca)
build_columns(MS.column_info_,
Carseats,
pca_var)[0]
mynewvar[0] | mynewvar[1] | |
---|---|---|
0 | 3.595740 | 4.850530 |
1 | -15.070401 | -35.706773 |
2 | -27.412228 | -40.772377 |
3 | 33.983048 | -13.468087 |
4 | -6.580644 | 11.287452 |
... | ... | ... |
395 | 36.856308 | 18.418138 |
396 | -45.731520 | -3.243768 |
397 | -49.087659 | 35.727136 |
398 | 13.565178 | -18.847760 |
399 | -31.917072 | -0.976615 |
400 rows × 2 columns
The elements of the variables
attribute may be column identifiers ( "Price"
), Column
instances (price
)
or Feature
instances (pca_var
).
price = MS.column_info_['Price']
fancy_var = Feature(('Income', price, pca_var), name='fancy', encoder=None)
build_columns(MS.column_info_,
Carseats,
fancy_var)[0]
Income | Price | mynewvar[0] | mynewvar[1] | |
---|---|---|---|---|
0 | 73.0 | 120.0 | 3.595740 | 4.850530 |
1 | 48.0 | 83.0 | -15.070401 | -35.706773 |
2 | 35.0 | 80.0 | -27.412228 | -40.772377 |
3 | 100.0 | 97.0 | 33.983048 | -13.468087 |
4 | 64.0 | 128.0 | -6.580644 | 11.287452 |
... | ... | ... | ... | ... |
395 | 108.0 | 128.0 | 36.856308 | 18.418138 |
396 | 23.0 | 120.0 | -45.731520 | -3.243768 |
397 | 26.0 | 159.0 | -49.087659 | 35.727136 |
398 | 79.0 | 95.0 | 13.565178 | -18.847760 |
399 | 37.0 | 120.0 | -31.917072 | -0.976615 |
400 rows × 4 columns
Predicting at new points#
MS = ModelSpec(['Price', 'Income']).fit(Carseats)
X = MS.transform(Carseats)
Y = Carseats['Sales']
M_ols = sm.OLS(Y, X).fit()
M_ols.params
intercept 12.661546
Price -0.052213
Income 0.012829
dtype: float64
As ModelSpec
is a transformer, it can be evaluated at new feature values.
Constructing the design matrix at any values is carried out by the transform
method.
new_data = pd.DataFrame({'Price':[40, 50], 'Income':[10, 20]})
new_X = MS.transform(new_data)
M_ols.get_prediction(new_X).predicted_mean
array([10.70130676, 10.307465 ])
Using np.ndarray
#
As the basic model is to concatenate columns extracted from a columnar data
representation, one can use np.ndarray
as the column data. In this case,
columns will be selected by integer indices.
Caveats using np.ndarray
#
If the terms
only refer to a few columns of the data frame, the transform
method only needs a dataframe with those columns.
However,
unless all features are floats, np.ndarray
will default to a dtype of object
, complicating issues.
However, if we had used an np.ndarray
, the column identifiers would be integers identifying specific columns so,
in order to work correctly, transform
would need another np.ndarray
where the columns have the same meaning.
We illustrate this below, where we build a model from Price
and Income
for Sales
and want to find predictions at new
values of Price
and Location
. We first find the predicitions using pd.DataFrame
and then illustrate the difficulties
in using np.ndarray
.
We will refit this model, using ModelSpec
with an np.ndarray
instead
Carseats_np = np.asarray(Carseats[['Price', 'Education', 'Income']])
MS_np = ModelSpec([0,2]).fit(Carseats_np)
MS_np.transform(Carseats_np)
array([[ 1., 120., 73.],
[ 1., 83., 48.],
[ 1., 80., 35.],
...,
[ 1., 159., 26.],
[ 1., 95., 79.],
[ 1., 120., 37.]])
M_ols_np = sm.OLS(Y, MS_np.transform(Carseats_np)).fit()
M_ols_np.params
const 12.661546
x1 -0.052213
x2 0.012829
dtype: float64
Now, let’s consider finding the design matrix at new points.
When using pd.DataFrame
we only need to supply the transform
method
a data frame with columns implicated in the terms
argument (in this case, Price
and Income
).
However, when using np.ndarray
with integers as indices, Price
was column 0 and Income
was column 2. The only
sensible way to produce a return for predict is to extract its 0th and 2nd columns. Note this means
that the meaning of columns in an np.ndarray
provided to transform
essentially must be identical to those
passed to fit
.
try:
new_D = np.array([[40,50], [10,20]]).T
new_X = MS_np.transform(new_D)
except IndexError as e:
print(e)
index 2 is out of bounds for axis 1 with size 2
Ultimately, M
expects 3 columns for new predictions because it was fit
with a matrix having 3 columns (the first representing an intercept).
We might be tempted to try as with the pd.DataFrame
and produce
an np.ndarray
with only the necessary variables.
new_D = np.array([[40,50], [np.nan, np.nan], [10,20]]).T
new_X = MS_np.transform(new_D)
print(new_X)
M_ols.get_prediction(new_X).predicted_mean
[[ 1. 40. 10.]
[ 1. 50. 20.]]
array([10.70130676, 10.307465 ])
For more complicated design contructions ensuring the columns of new_D
match that of the original data will be more cumbersome. We expect
then that pd.DataFrame
(or a columnar data representation with similar API) will likely be easier to use with ModelSpec
.