Building design matrices with ModelSpec#

The ISLP package provides a facility to build design matrices for regression and classification tasks. It provides similar functionality to the formula notation of R though uses python objects rather than specification through the special formula syntax.

Related tools include patsy and ColumnTransformer from sklearn.compose.

Perhaps the most common use is to extract some columns from a pd.DataFrame and produce a design matrix, optionally with an intercept.

import pandas as pd
import numpy as np

from ISLP import load_data
from ISLP.models import (ModelSpec,
                         summarize,
                         Column,
                         Feature,
                         build_columns)

import statsmodels.api as sm
Carseats = load_data('Carseats')
Carseats.columns
Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')

We’ll first build a design matrix that we can use to model Sales in terms of the categorical variable ShelveLoc and Price.

We see first that ShelveLoc is a categorical variable:

Carseats['ShelveLoc']
0         Bad
1        Good
2      Medium
3      Medium
4         Bad
        ...  
395      Good
396    Medium
397    Medium
398       Bad
399      Good
Name: ShelveLoc, Length: 400, dtype: category
Categories (3, object): ['Bad', 'Good', 'Medium']

This is recognized by ModelSpec and only 2 columns are added for the three levels. The default behavior is to drop the first level of the categories. Later, we will show other contrasts of the 3 columns can be produced.

This simple example below illustrates how the first argument (its terms) is used to construct a design matrix.

MS = ModelSpec(['ShelveLoc', 'Price'])
X = MS.fit_transform(Carseats)
X.iloc[:10]
intercept ShelveLoc[Good] ShelveLoc[Medium] Price
0 1.0 0.0 0.0 120
1 1.0 1.0 0.0 83
2 1.0 0.0 1.0 80
3 1.0 0.0 1.0 97
4 1.0 0.0 0.0 128
5 1.0 0.0 0.0 72
6 1.0 0.0 1.0 108
7 1.0 1.0 0.0 120
8 1.0 0.0 1.0 124
9 1.0 0.0 1.0 124

We note that a column has been added for the intercept by default. This can be changed using the intercept argument.

MS_no1 = ModelSpec(['ShelveLoc', 'Price'], intercept=False)
MS_no1.fit_transform(Carseats)[:10]
ShelveLoc[Good] ShelveLoc[Medium] Price
0 0.0 0.0 120
1 1.0 0.0 83
2 0.0 1.0 80
3 0.0 1.0 97
4 0.0 0.0 128
5 0.0 0.0 72
6 0.0 1.0 108
7 1.0 0.0 120
8 0.0 1.0 124
9 0.0 1.0 124

We see that ShelveLoc still only contributes two columns to the design. The ModelSpec object does no introspection of its arguments to effectively include an intercept term in the column space of the design matrix.

To include this intercept via ShelveLoc we can use 3 columns to encode this categorical variable. Following the nomenclature of R, we call this a Contrast of the categorical variable.

from ISLP.models import contrast
shelve = contrast('ShelveLoc', None)
MS_contr = ModelSpec([shelve, 'Price'], intercept=False)
MS_contr.fit_transform(Carseats)[:10]
ShelveLoc[Bad] ShelveLoc[Good] ShelveLoc[Medium] Price
0 1.0 0.0 0.0 120
1 0.0 1.0 0.0 83
2 0.0 0.0 1.0 80
3 0.0 0.0 1.0 97
4 1.0 0.0 0.0 128
5 1.0 0.0 0.0 72
6 0.0 0.0 1.0 108
7 0.0 1.0 0.0 120
8 0.0 0.0 1.0 124
9 0.0 0.0 1.0 124

This example above illustrates that columns need not be identified by name in terms. The basic role of an item in the terms sequence is a description of how to extract a column from a columnar data object, usually a pd.DataFrame.

shelve
Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=(), encoder=Contrast(method=None))

The Column object can be used to directly extract relevant columns from a pd.DataFrame. If the encoder field is not None, then the extracted columns will be passed through encoder. The get_columns method produces these columns as well as names for the columns.

shelve.get_columns(Carseats)
(array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        ...,
        [0., 0., 1.],
        [1., 0., 0.],
        [0., 1., 0.]]),
 ['ShelveLoc[Bad]', 'ShelveLoc[Good]', 'ShelveLoc[Medium]'])

Let’s now fit a simple OLS model with this design.

X = MS_contr.transform(Carseats)
Y = Carseats['Sales']
M_ols = sm.OLS(Y, X).fit()
summarize(M_ols)
coef std err t P>|t|
ShelveLoc[Bad] 12.0018 0.503 23.839 0.0
ShelveLoc[Good] 16.8976 0.522 32.386 0.0
ShelveLoc[Medium] 13.8638 0.487 28.467 0.0
Price -0.0567 0.004 -13.967 0.0

Interactions#

One of the common uses of formulae in R is to specify interactions between variables. This is done in ModelSpec by including a tuple in the terms argument.

ModelSpec([(shelve, 'Price'), 'Price']).fit_transform(Carseats).iloc[:10]
intercept ShelveLoc[Bad]:Price ShelveLoc[Good]:Price ShelveLoc[Medium]:Price Price
0 1.0 120.0 0.0 0.0 120
1 1.0 0.0 83.0 0.0 83
2 1.0 0.0 0.0 80.0 80
3 1.0 0.0 0.0 97.0 97
4 1.0 128.0 0.0 0.0 128
5 1.0 72.0 0.0 0.0 72
6 1.0 0.0 0.0 108.0 108
7 1.0 0.0 120.0 0.0 120
8 1.0 0.0 0.0 124.0 124
9 1.0 0.0 0.0 124.0 124

The above design matrix is clearly rank deficient, as ModelSpec has not inspected the formula and attempted to produce a corresponding matrix that may or may not match a user’s intent.

Ordinal variables#

Ordinal variables are handled by a corresponding encoder)

Carseats['OIncome'] = pd.cut(Carseats['Income'], 
                             [0,50,90,200], 
                             labels=['L','M','H'])
MS_order = ModelSpec(['OIncome']).fit(Carseats)

Part of the fit method of ModelSpec involves inspection of the columns of Carseats. The results of that inspection can be found in the column_info_ attribute:

MS_order.column_info_
{'Sales': Column(idx='Sales', name='Sales', is_categorical=False, is_ordinal=False, columns=('Sales',), encoder=None),
 'CompPrice': Column(idx='CompPrice', name='CompPrice', is_categorical=False, is_ordinal=False, columns=('CompPrice',), encoder=None),
 'Income': Column(idx='Income', name='Income', is_categorical=False, is_ordinal=False, columns=('Income',), encoder=None),
 'Advertising': Column(idx='Advertising', name='Advertising', is_categorical=False, is_ordinal=False, columns=('Advertising',), encoder=None),
 'Population': Column(idx='Population', name='Population', is_categorical=False, is_ordinal=False, columns=('Population',), encoder=None),
 'Price': Column(idx='Price', name='Price', is_categorical=False, is_ordinal=False, columns=('Price',), encoder=None),
 'ShelveLoc': Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=('ShelveLoc[Good]', 'ShelveLoc[Medium]'), encoder=Contrast()),
 'Age': Column(idx='Age', name='Age', is_categorical=False, is_ordinal=False, columns=('Age',), encoder=None),
 'Education': Column(idx='Education', name='Education', is_categorical=False, is_ordinal=False, columns=('Education',), encoder=None),
 'Urban': Column(idx='Urban', name='Urban', is_categorical=True, is_ordinal=False, columns=('Urban[Yes]',), encoder=Contrast()),
 'US': Column(idx='US', name='US', is_categorical=True, is_ordinal=False, columns=('US[Yes]',), encoder=Contrast()),
 'OIncome': Column(idx='OIncome', name='OIncome', is_categorical=True, is_ordinal=True, columns=('OIncome',), encoder=OrdinalEncoder())}

Structure of a ModelSpec#

The first argument to ModelSpec is stored as the terms attribute. Under the hood, this sequence is inspected to produce the terms_ attribute which specify the objects that will ultimately create the design matrix.

MS = ModelSpec(['ShelveLoc', 'Price'])
MS.fit(Carseats)
MS.terms_
[Feature(variables=('ShelveLoc',), name='ShelveLoc', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),
 Feature(variables=('Price',), name='Price', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]

Each element of terms_ should be a Feature which describes a set of columns to be extracted from a columnar data form as well as possible a possible encoder.

shelve_var = MS.terms_[0]

We can find the columns associated to each term using the build_columns method of ModelSpec:

df, names = build_columns(MS.column_info_,
                          Carseats, 
                          shelve_var)
df
ShelveLoc[Good] ShelveLoc[Medium]
0 0.0 0.0
1 1.0 0.0
2 0.0 1.0
3 0.0 1.0
4 0.0 0.0
... ... ...
395 1.0 0.0
396 0.0 1.0
397 0.0 1.0
398 0.0 0.0
399 1.0 0.0

400 rows × 2 columns

The design matrix is constructed by running through terms_ and concatenating the corresponding columns.

Feature objects#

Note that Feature objects have a tuple of variables as well as an encoder attribute. The tuple of variables first creates a concatenated dataframe from all corresponding variables and then is run through encoder.transform. The encoder.fit method of each Feature is run once during the call to ModelSpec.fit.

new_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=None)
build_columns(MS.column_info_,
              Carseats, 
              new_var)[0]
Price Income OIncome
0 120.0 73.0 2.0
1 83.0 48.0 1.0
2 80.0 35.0 1.0
3 97.0 100.0 0.0
4 128.0 64.0 2.0
... ... ... ...
395 128.0 108.0 0.0
396 120.0 23.0 1.0
397 159.0 26.0 1.0
398 95.0 79.0 2.0
399 120.0 37.0 1.0

400 rows × 3 columns

Let’s now transform these columns with an encoder. Within ModelSpec we will first build the arrays above and then call pca.fit and finally pca.transform within design.build_columns.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(build_columns(MS.column_info_, Carseats, new_var)[0]) # this is done within `ModelSpec.fit`
pca_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=pca)
build_columns(MS.column_info_,
              Carseats, 
              pca_var)[0]
mynewvar[0] mynewvar[1]
0 3.595740 4.850530
1 -15.070401 -35.706773
2 -27.412228 -40.772377
3 33.983048 -13.468087
4 -6.580644 11.287452
... ... ...
395 36.856308 18.418138
396 -45.731520 -3.243768
397 -49.087659 35.727136
398 13.565178 -18.847760
399 -31.917072 -0.976615

400 rows × 2 columns

The elements of the variables attribute may be column identifiers ( "Price"), Column instances (price) or Feature instances (pca_var).

price = MS.column_info_['Price']
fancy_var = Feature(('Income', price, pca_var), name='fancy', encoder=None)
build_columns(MS.column_info_,
              Carseats, 
              fancy_var)[0]
Income Price mynewvar[0] mynewvar[1]
0 73.0 120.0 3.595740 4.850530
1 48.0 83.0 -15.070401 -35.706773
2 35.0 80.0 -27.412228 -40.772377
3 100.0 97.0 33.983048 -13.468087
4 64.0 128.0 -6.580644 11.287452
... ... ... ... ...
395 108.0 128.0 36.856308 18.418138
396 23.0 120.0 -45.731520 -3.243768
397 26.0 159.0 -49.087659 35.727136
398 79.0 95.0 13.565178 -18.847760
399 37.0 120.0 -31.917072 -0.976615

400 rows × 4 columns

Predicting at new points#

MS = ModelSpec(['Price', 'Income']).fit(Carseats)
X = MS.transform(Carseats)
Y = Carseats['Sales']
M_ols = sm.OLS(Y, X).fit()
M_ols.params
intercept    12.661546
Price        -0.052213
Income        0.012829
dtype: float64

As ModelSpec is a transformer, it can be evaluated at new feature values. Constructing the design matrix at any values is carried out by the transform method.

new_data = pd.DataFrame({'Price':[40, 50], 'Income':[10, 20]})
new_X = MS.transform(new_data)
M_ols.get_prediction(new_X).predicted_mean
array([10.70130676, 10.307465  ])

Using np.ndarray#

As the basic model is to concatenate columns extracted from a columnar data representation, one can use np.ndarray as the column data. In this case, columns will be selected by integer indices.

Caveats using np.ndarray#

If the terms only refer to a few columns of the data frame, the transform method only needs a dataframe with those columns. However, unless all features are floats, np.ndarray will default to a dtype of object, complicating issues.

However, if we had used an np.ndarray, the column identifiers would be integers identifying specific columns so, in order to work correctly, transform would need another np.ndarray where the columns have the same meaning.

We illustrate this below, where we build a model from Price and Income for Sales and want to find predictions at new values of Price and Location. We first find the predicitions using pd.DataFrame and then illustrate the difficulties in using np.ndarray.

We will refit this model, using ModelSpec with an np.ndarray instead

Carseats_np = np.asarray(Carseats[['Price', 'Education', 'Income']])
MS_np = ModelSpec([0,2]).fit(Carseats_np)
MS_np.transform(Carseats_np)
array([[  1., 120.,  73.],
       [  1.,  83.,  48.],
       [  1.,  80.,  35.],
       ...,
       [  1., 159.,  26.],
       [  1.,  95.,  79.],
       [  1., 120.,  37.]])
M_ols_np = sm.OLS(Y, MS_np.transform(Carseats_np)).fit()
M_ols_np.params
const    12.661546
x1       -0.052213
x2        0.012829
dtype: float64

Now, let’s consider finding the design matrix at new points. When using pd.DataFrame we only need to supply the transform method a data frame with columns implicated in the terms argument (in this case, Price and Income).

However, when using np.ndarray with integers as indices, Price was column 0 and Income was column 2. The only sensible way to produce a return for predict is to extract its 0th and 2nd columns. Note this means that the meaning of columns in an np.ndarray provided to transform essentially must be identical to those passed to fit.

try:
    new_D = np.array([[40,50], [10,20]]).T
    new_X = MS_np.transform(new_D)
except IndexError as e:
    print(e)
index 2 is out of bounds for axis 1 with size 2

Ultimately, M expects 3 columns for new predictions because it was fit with a matrix having 3 columns (the first representing an intercept).

We might be tempted to try as with the pd.DataFrame and produce an np.ndarray with only the necessary variables.

new_D = np.array([[40,50], [np.nan, np.nan], [10,20]]).T
new_X = MS_np.transform(new_D)
print(new_X)
M_ols.get_prediction(new_X).predicted_mean
[[ 1. 40. 10.]
 [ 1. 50. 20.]]
array([10.70130676, 10.307465  ])

For more complicated design contructions ensuring the columns of new_D match that of the original data will be more cumbersome. We expect then that pd.DataFrame (or a columnar data representation with similar API) will likely be easier to use with ModelSpec.