Building design matrices with `ModelSpec`#

The ISLP package provides a facility to build design matrices for regression and classification tasks. It provides similar functionality to the formula notation of R though uses python objects rather than specification through the special formula syntax.

Related tools include patsy and ColumnTransformer from sklearn.compose.

Perhaps the most common use is to extract some columns from a pd.DataFrame and produce a design matrix, optionally with an intercept.

import pandas as pd
import numpy as np

from ISLP import load_data
from ISLP.models import (ModelSpec,
                         summarize,
                         Column,
                         Feature,
                         build_columns)

import statsmodels.api as sm

Carseats = load_data('Carseats')
Carseats.columns

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')

We’ll first build a design matrix that we can use to model Sales in terms of the categorical variable ShelveLoc and Price.

We see first that ShelveLoc is a categorical variable:

Carseats['ShelveLoc']

       Bad
      Good
    Medium
    Medium
       Bad
        ...  
    Good
  Medium
  Medium
     Bad
    Good
Name: ShelveLoc, Length: 400, dtype: category
Categories (3, object): ['Bad', 'Good', 'Medium']

This is recognized by ModelSpec and only 2 columns are added for the three levels. The default behavior is to drop the first level of the categories. Later, we will show other contrasts of the 3 columns can be produced.

This simple example below illustrates how the first argument (its terms) is used to construct a design matrix.

MS = ModelSpec(['ShelveLoc', 'Price'])
X = MS.fit_transform(Carseats)
X.iloc[:10]

	intercept	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	1.0	1.0	0.0	83
2	1.0	0.0	1.0	80
3	1.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	1.0	0.0	1.0	108
7	1.0	1.0	0.0	120
8	1.0	0.0	1.0	124
9	1.0	0.0	1.0	124

We note that a column has been added for the intercept by default. This can be changed using the intercept argument.

MS_no1 = ModelSpec(['ShelveLoc', 'Price'], intercept=False)
MS_no1.fit_transform(Carseats)[:10]

	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	0.0	0.0	120
1	1.0	0.0	83
2	0.0	1.0	80
3	0.0	1.0	97
4	0.0	0.0	128
5	0.0	0.0	72
6	0.0	1.0	108
7	1.0	0.0	120
8	0.0	1.0	124
9	0.0	1.0	124

We see that ShelveLoc still only contributes two columns to the design. The ModelSpec object does no introspection of its arguments to effectively include an intercept term in the column space of the design matrix.

To include this intercept via ShelveLoc we can use 3 columns to encode this categorical variable. Following the nomenclature of R, we call this a Contrast of the categorical variable.

from ISLP.models import contrast
shelve = contrast('ShelveLoc', None)
MS_contr = ModelSpec([shelve, 'Price'], intercept=False)
MS_contr.fit_transform(Carseats)[:10]

	ShelveLoc[Bad]	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	0.0	1.0	0.0	83
2	0.0	0.0	1.0	80
3	0.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	0.0	0.0	1.0	108
7	0.0	1.0	0.0	120
8	0.0	0.0	1.0	124
9	0.0	0.0	1.0	124

This example above illustrates that columns need not be identified by name in terms. The basic role of an item in the terms sequence is a description of how to extract a column from a columnar data object, usually a pd.DataFrame.

shelve

Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=(), encoder=Contrast(method=None))

The Column object can be used to directly extract relevant columns from a pd.DataFrame. If the encoder field is not None, then the extracted columns will be passed through encoder. The get_columns method produces these columns as well as names for the columns.

shelve.get_columns(Carseats)

(array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        ...,
        [0., 0., 1.],
        [1., 0., 0.],
        [0., 1., 0.]]),
 ['ShelveLoc[Bad]', 'ShelveLoc[Good]', 'ShelveLoc[Medium]'])

Let’s now fit a simple OLS model with this design.

X = MS_contr.transform(Carseats)
Y = Carseats['Sales']
M_ols = sm.OLS(Y, X).fit()
summarize(M_ols)

	coef	std err	t
ShelveLoc[Bad]	12.0018	0.503	23.839
ShelveLoc[Good]	16.8976	0.522	32.386
ShelveLoc[Medium]	13.8638	0.487	28.467
Price	-0.0567	0.004	-13.967

Interactions#

One of the common uses of formulae in R is to specify interactions between variables. This is done in ModelSpec by including a tuple in the terms argument.

ModelSpec([(shelve, 'Price'), 'Price']).fit_transform(Carseats).iloc[:10]

	intercept	ShelveLoc[Bad]:Price	ShelveLoc[Good]:Price	ShelveLoc[Medium]:Price	Price
0	1.0	120.0	0.0	0.0	120
1	1.0	0.0	83.0	0.0	83
2	1.0	0.0	0.0	80.0	80
3	1.0	0.0	0.0	97.0	97
4	1.0	128.0	0.0	0.0	128
5	1.0	72.0	0.0	0.0	72
6	1.0	0.0	0.0	108.0	108
7	1.0	0.0	120.0	0.0	120
8	1.0	0.0	0.0	124.0	124
9	1.0	0.0	0.0	124.0	124

The above design matrix is clearly rank deficient, as ModelSpec has not inspected the formula and attempted to produce a corresponding matrix that may or may not match a user’s intent.

Ordinal variables#

Ordinal variables are handled by a corresponding encoder)

Carseats['OIncome'] = pd.cut(Carseats['Income'], 
                             [0,50,90,200], 
                             labels=['L','M','H'])
MS_order = ModelSpec(['OIncome']).fit(Carseats)

Part of the fit method of ModelSpec involves inspection of the columns of Carseats. The results of that inspection can be found in the column_info_ attribute:

MS_order.column_info_

{'Sales': Column(idx='Sales', name='Sales', is_categorical=False, is_ordinal=False, columns=('Sales',), encoder=None),
 'CompPrice': Column(idx='CompPrice', name='CompPrice', is_categorical=False, is_ordinal=False, columns=('CompPrice',), encoder=None),
 'Income': Column(idx='Income', name='Income', is_categorical=False, is_ordinal=False, columns=('Income',), encoder=None),
 'Advertising': Column(idx='Advertising', name='Advertising', is_categorical=False, is_ordinal=False, columns=('Advertising',), encoder=None),
 'Population': Column(idx='Population', name='Population', is_categorical=False, is_ordinal=False, columns=('Population',), encoder=None),
 'Price': Column(idx='Price', name='Price', is_categorical=False, is_ordinal=False, columns=('Price',), encoder=None),
 'ShelveLoc': Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=('ShelveLoc[Good]', 'ShelveLoc[Medium]'), encoder=Contrast()),
 'Age': Column(idx='Age', name='Age', is_categorical=False, is_ordinal=False, columns=('Age',), encoder=None),
 'Education': Column(idx='Education', name='Education', is_categorical=False, is_ordinal=False, columns=('Education',), encoder=None),
 'Urban': Column(idx='Urban', name='Urban', is_categorical=True, is_ordinal=False, columns=('Urban[Yes]',), encoder=Contrast()),
 'US': Column(idx='US', name='US', is_categorical=True, is_ordinal=False, columns=('US[Yes]',), encoder=Contrast()),
 'OIncome': Column(idx='OIncome', name='OIncome', is_categorical=True, is_ordinal=True, columns=('OIncome',), encoder=OrdinalEncoder())}

Structure of a `ModelSpec`#

The first argument to ModelSpec is stored as the terms attribute. Under the hood, this sequence is inspected to produce the terms_ attribute which specify the objects that will ultimately create the design matrix.

MS = ModelSpec(['ShelveLoc', 'Price'])
MS.fit(Carseats)
MS.terms_

[Feature(variables=('ShelveLoc',), name='ShelveLoc', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),
 Feature(variables=('Price',), name='Price', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]

Each element of terms_ should be a Feature which describes a set of columns to be extracted from a columnar data form as well as possible a possible encoder.

shelve_var = MS.terms_[0]

We can find the columns associated to each term using the build_columns method of ModelSpec:

df, names = build_columns(MS.column_info_,
                          Carseats, 
                          shelve_var)
df

	ShelveLoc[Good]	ShelveLoc[Medium]
0	0.0	0.0
1	1.0	0.0
2	0.0	1.0
3	0.0	1.0
4	0.0	0.0
...	...	...
395	1.0	0.0
396	0.0	1.0
397	0.0	1.0
398	0.0	0.0
399	1.0	0.0

400 rows × 2 columns

The design matrix is constructed by running through terms_ and concatenating the corresponding columns.

`Feature` objects#

Note that Feature objects have a tuple of variables as well as an encoder attribute. The tuple of variables first creates a concatenated dataframe from all corresponding variables and then is run through encoder.transform. The encoder.fit method of each Feature is run once during the call to ModelSpec.fit.

new_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=None)
build_columns(MS.column_info_,
              Carseats, 
              new_var)[0]

	Price	Income	OIncome
0	120.0	73.0	2.0
1	83.0	48.0	1.0
2	80.0	35.0	1.0
3	97.0	100.0	0.0
4	128.0	64.0	2.0
...	...	...	...
395	128.0	108.0	0.0
396	120.0	23.0	1.0
397	159.0	26.0	1.0
398	95.0	79.0	2.0
399	120.0	37.0	1.0

400 rows × 3 columns

Let’s now transform these columns with an encoder. Within ModelSpec we will first build the arrays above and then call pca.fit and finally pca.transform within design.build_columns.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(build_columns(MS.column_info_, Carseats, new_var)[0]) # this is done within `ModelSpec.fit`
pca_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=pca)
build_columns(MS.column_info_,
              Carseats, 
              pca_var)[0]

	mynewvar[0]	mynewvar[1]
0	3.595740	4.850530
1	-15.070401	-35.706773
2	-27.412228	-40.772377
3	33.983048	-13.468087
4	-6.580644	11.287452
...	...	...
395	36.856308	18.418138
396	-45.731520	-3.243768
397	-49.087659	35.727136
398	13.565178	-18.847760
399	-31.917072	-0.976615

400 rows × 2 columns

The elements of the variables attribute may be column identifiers ( "Price"), Column instances (price) or Feature instances (pca_var).

price = MS.column_info_['Price']
fancy_var = Feature(('Income', price, pca_var), name='fancy', encoder=None)
build_columns(MS.column_info_,
              Carseats, 
              fancy_var)[0]

	Income	Price	mynewvar[0]	mynewvar[1]
0	73.0	120.0	3.595740	4.850530
1	48.0	83.0	-15.070401	-35.706773
2	35.0	80.0	-27.412228	-40.772377
3	100.0	97.0	33.983048	-13.468087
4	64.0	128.0	-6.580644	11.287452
...	...	...	...	...
395	108.0	128.0	36.856308	18.418138
396	23.0	120.0	-45.731520	-3.243768
397	26.0	159.0	-49.087659	35.727136
398	79.0	95.0	13.565178	-18.847760
399	37.0	120.0	-31.917072	-0.976615

400 rows × 4 columns

Predicting at new points#

MS = ModelSpec(['Price', 'Income']).fit(Carseats)
X = MS.transform(Carseats)
Y = Carseats['Sales']
M_ols = sm.OLS(Y, X).fit()
M_ols.params

intercept    12.661546
Price        -0.052213
Income        0.012829
dtype: float64

As ModelSpec is a transformer, it can be evaluated at new feature values. Constructing the design matrix at any values is carried out by the transform method.

new_data = pd.DataFrame({'Price':[40, 50], 'Income':[10, 20]})
new_X = MS.transform(new_data)
M_ols.get_prediction(new_X).predicted_mean

array([10.70130676, 10.307465  ])

Using `np.ndarray`#

As the basic model is to concatenate columns extracted from a columnar data representation, one can use np.ndarray as the column data. In this case, columns will be selected by integer indices.

Caveats using `np.ndarray`#

If the terms only refer to a few columns of the data frame, the transform method only needs a dataframe with those columns. However, unless all features are floats, np.ndarray will default to a dtype of object, complicating issues.

However, if we had used an np.ndarray, the column identifiers would be integers identifying specific columns so, in order to work correctly, transform would need another np.ndarray where the columns have the same meaning.

We illustrate this below, where we build a model from Price and Income for Sales and want to find predictions at new values of Price and Location. We first find the predicitions using pd.DataFrame and then illustrate the difficulties in using np.ndarray.

We will refit this model, using ModelSpec with an np.ndarray instead

Carseats_np = np.asarray(Carseats[['Price', 'Education', 'Income']])
MS_np = ModelSpec([0,2]).fit(Carseats_np)
MS_np.transform(Carseats_np)

array([[  1., 120.,  73.],
       [  1.,  83.,  48.],
       [  1.,  80.,  35.],
       ...,
       [  1., 159.,  26.],
       [  1.,  95.,  79.],
       [  1., 120.,  37.]])

M_ols_np = sm.OLS(Y, MS_np.transform(Carseats_np)).fit()
M_ols_np.params

const    12.661546
x1       -0.052213
x2        0.012829
dtype: float64

Now, let’s consider finding the design matrix at new points. When using pd.DataFrame we only need to supply the transform method a data frame with columns implicated in the terms argument (in this case, Price and Income).

However, when using np.ndarray with integers as indices, Price was column 0 and Income was column 2. The only sensible way to produce a return for predict is to extract its 0th and 2nd columns. Note this means that the meaning of columns in an np.ndarray provided to transform essentially must be identical to those passed to fit.

try:
    new_D = np.array([[40,50], [10,20]]).T
    new_X = MS_np.transform(new_D)
except IndexError as e:
    print(e)

index 2 is out of bounds for axis 1 with size 2

Ultimately, M expects 3 columns for new predictions because it was fit with a matrix having 3 columns (the first representing an intercept).

We might be tempted to try as with the pd.DataFrame and produce an np.ndarray with only the necessary variables.

new_D = np.array([[40,50], [np.nan, np.nan], [10,20]]).T
new_X = MS_np.transform(new_D)
print(new_X)
M_ols.get_prediction(new_X).predicted_mean

[[ 1. 40. 10.]
 [ 1. 50. 20.]]

array([10.70130676, 10.307465  ])

For more complicated design contructions ensuring the columns of new_D match that of the original data will be more cumbersome. We expect then that pd.DataFrame (or a columnar data representation with similar API) will likely be easier to use with ModelSpec.

Building design matrices with ModelSpec

Contents

Building design matrices with `ModelSpec`#

Interactions#

Ordinal variables#

Structure of a `ModelSpec`#

`Feature` objects#

Predicting at new points#

Using `np.ndarray`#

Caveats using `np.ndarray`#

	intercept	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	1.0	1.0	0.0	83
2	1.0	0.0	1.0	80
3	1.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	1.0	0.0	1.0	108
7	1.0	1.0	0.0	120
8	1.0	0.0	1.0	124
9	1.0	0.0	1.0	124

	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	0.0	0.0	120
1	1.0	0.0	83
2	0.0	1.0	80
3	0.0	1.0	97
4	0.0	0.0	128
5	0.0	0.0	72
6	0.0	1.0	108
7	1.0	0.0	120
8	0.0	1.0	124
9	0.0	1.0	124

	ShelveLoc[Bad]	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	0.0	1.0	0.0	83
2	0.0	0.0	1.0	80
3	0.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	0.0	0.0	1.0	108
7	0.0	1.0	0.0	120
8	0.0	0.0	1.0	124
9	0.0	0.0	1.0	124

	ShelveLoc[Good]	ShelveLoc[Medium]
0	0.0	0.0
1	1.0	0.0
2	0.0	1.0
3	0.0	1.0
4	0.0	0.0
...	...	...
395	1.0	0.0
396	0.0	1.0
397	0.0	1.0
398	0.0	0.0
399	1.0	0.0

	intercept	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	1.0	1.0	0.0	83
2	1.0	0.0	1.0	80
3	1.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	1.0	0.0	1.0	108
7	1.0	1.0	0.0	120
8	1.0	0.0	1.0	124
9	1.0	0.0	1.0	124

	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	0.0	0.0	120
1	1.0	0.0	83
2	0.0	1.0	80
3	0.0	1.0	97
4	0.0	0.0	128
5	0.0	0.0	72
6	0.0	1.0	108
7	1.0	0.0	120
8	0.0	1.0	124
9	0.0	1.0	124

	ShelveLoc[Bad]	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	0.0	1.0	0.0	83
2	0.0	0.0	1.0	80
3	0.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	0.0	0.0	1.0	108
7	0.0	1.0	0.0	120
8	0.0	0.0	1.0	124
9	0.0	0.0	1.0	124

	ShelveLoc[Good]	ShelveLoc[Medium]
0	0.0	0.0
1	1.0	0.0
2	0.0	1.0
3	0.0	1.0
4	0.0	0.0
...	...	...
395	1.0	0.0
396	0.0	1.0
397	0.0	1.0
398	0.0	0.0
399	1.0	0.0

Building design matrices with ModelSpec

Contents

Building design matrices with ModelSpec#

Interactions#

Ordinal variables#

Structure of a ModelSpec#

Feature objects#

Predicting at new points#

Using np.ndarray#

Caveats using np.ndarray#

Building design matrices with `ModelSpec`#

Structure of a `ModelSpec`#

`Feature` objects#

Using `np.ndarray`#

Caveats using `np.ndarray`#

	intercept	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	1.0	1.0	0.0	83
2	1.0	0.0	1.0	80
3	1.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	1.0	0.0	1.0	108
7	1.0	1.0	0.0	120
8	1.0	0.0	1.0	124
9	1.0	0.0	1.0	124

	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	0.0	0.0	120
1	1.0	0.0	83
2	0.0	1.0	80
3	0.0	1.0	97
4	0.0	0.0	128
5	0.0	0.0	72
6	0.0	1.0	108
7	1.0	0.0	120
8	0.0	1.0	124
9	0.0	1.0	124

	ShelveLoc[Bad]	ShelveLoc[Good]	ShelveLoc[Medium]	Price
0	1.0	0.0	0.0	120
1	0.0	1.0	0.0	83
2	0.0	0.0	1.0	80
3	0.0	0.0	1.0	97
4	1.0	0.0	0.0	128
5	1.0	0.0	0.0	72
6	0.0	0.0	1.0	108
7	0.0	1.0	0.0	120
8	0.0	0.0	1.0	124
9	0.0	0.0	1.0	124

	ShelveLoc[Good]	ShelveLoc[Medium]
0	0.0	0.0
1	1.0	0.0
2	0.0	1.0
3	0.0	1.0
4	0.0	0.0
...	...	...
395	1.0	0.0
396	0.0	1.0
397	0.0	1.0
398	0.0	0.0
399	1.0	0.0