{
"cells": [
{
"cell_type": "markdown",
"id": "typical-correlation",
"metadata": {},
"source": [
"# Building design matrices with `ModelSpec`\n",
"\n",
"The `ISLP` package provides a facility to build design\n",
"matrices for regression and classification tasks. It provides similar functionality to the formula\n",
"notation of `R` though uses python objects rather than specification through the special formula syntax.\n",
"\n",
"Related tools include `patsy` and `ColumnTransformer` from `sklearn.compose`. \n",
"\n",
"Perhaps the most common use is to extract some columns from a `pd.DataFrame` and \n",
"produce a design matrix, optionally with an intercept."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "sticky-desperate",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from ISLP import load_data\n",
"from ISLP.models import (ModelSpec,\n",
" summarize,\n",
" Column,\n",
" Feature,\n",
" build_columns)\n",
"\n",
"import statsmodels.api as sm"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "devoted-antique",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',\n",
" 'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],\n",
" dtype='object')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Carseats = load_data('Carseats')\n",
"Carseats.columns"
]
},
{
"cell_type": "markdown",
"id": "b7a2e6ab-491d-4a57-8184-a9fcccb2047b",
"metadata": {},
"source": [
"We'll first build a design matrix that we can use to model `Sales`\n",
"in terms of the categorical variable `ShelveLoc` and `Price`.\n",
"\n",
"We see first that `ShelveLoc` is a categorical variable:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7d3642a6-90c6-48ad-8d35-88231b4991f8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Bad\n",
"1 Good\n",
"2 Medium\n",
"3 Medium\n",
"4 Bad\n",
" ... \n",
"395 Good\n",
"396 Medium\n",
"397 Medium\n",
"398 Bad\n",
"399 Good\n",
"Name: ShelveLoc, Length: 400, dtype: category\n",
"Categories (3, object): ['Bad', 'Good', 'Medium']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Carseats['ShelveLoc']"
]
},
{
"cell_type": "markdown",
"id": "4afa201d-4b19-4d85-9e1b-1392a54d027b",
"metadata": {},
"source": [
"This is recognized by `ModelSpec` and only 2 columns are added for the three levels. The\n",
"default behavior is to drop the first level of the categories. Later, \n",
"we will show other contrasts of the 3 columns can be produced. \n",
"\n",
"This simple example below illustrates how the first argument (its `terms`) is\n",
"used to construct a design matrix."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "fd5528fe-11da-4e10-8996-06085896c1a0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" intercept | \n",
" ShelveLoc[Good] | \n",
" ShelveLoc[Medium] | \n",
" Price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 83 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 80 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 97 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 128 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 72 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 108 | \n",
"
\n",
" \n",
" 7 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 8 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 124 | \n",
"
\n",
" \n",
" 9 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 124 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" intercept ShelveLoc[Good] ShelveLoc[Medium] Price\n",
"0 1.0 0.0 0.0 120\n",
"1 1.0 1.0 0.0 83\n",
"2 1.0 0.0 1.0 80\n",
"3 1.0 0.0 1.0 97\n",
"4 1.0 0.0 0.0 128\n",
"5 1.0 0.0 0.0 72\n",
"6 1.0 0.0 1.0 108\n",
"7 1.0 1.0 0.0 120\n",
"8 1.0 0.0 1.0 124\n",
"9 1.0 0.0 1.0 124"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MS = ModelSpec(['ShelveLoc', 'Price'])\n",
"X = MS.fit_transform(Carseats)\n",
"X.iloc[:10]"
]
},
{
"cell_type": "markdown",
"id": "6948e1ef-3685-4840-a4f2-ef15a1bcfb69",
"metadata": {},
"source": [
"We note that a column has been added for the intercept by default. This can be changed using the\n",
"`intercept` argument."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "682d4c81-eba9-467d-a176-911a0269a21d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ShelveLoc[Good] | \n",
" ShelveLoc[Medium] | \n",
" Price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 83 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 80 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 97 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 128 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 72 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 108 | \n",
"
\n",
" \n",
" 7 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 124 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 124 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ShelveLoc[Good] ShelveLoc[Medium] Price\n",
"0 0.0 0.0 120\n",
"1 1.0 0.0 83\n",
"2 0.0 1.0 80\n",
"3 0.0 1.0 97\n",
"4 0.0 0.0 128\n",
"5 0.0 0.0 72\n",
"6 0.0 1.0 108\n",
"7 1.0 0.0 120\n",
"8 0.0 1.0 124\n",
"9 0.0 1.0 124"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MS_no1 = ModelSpec(['ShelveLoc', 'Price'], intercept=False)\n",
"MS_no1.fit_transform(Carseats)[:10]"
]
},
{
"cell_type": "markdown",
"id": "54d8fd20-d8f5-44d6-9965-83e745680798",
"metadata": {},
"source": [
"We see that `ShelveLoc` still only contributes\n",
"two columns to the design. The `ModelSpec` object does no introspection of its arguments to effectively include an intercept term\n",
"in the column space of the design matrix.\n",
"\n",
"To include this intercept via `ShelveLoc` we can use 3 columns to encode this categorical variable. Following the nomenclature of\n",
"`R`, we call this a `Contrast` of the categorical variable."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "555734bb-2682-4721-a1cd-6fb207394b0e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ShelveLoc[Bad] | \n",
" ShelveLoc[Good] | \n",
" ShelveLoc[Medium] | \n",
" Price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 83 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 80 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 97 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 128 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 72 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 108 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 124 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 124 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ShelveLoc[Bad] ShelveLoc[Good] ShelveLoc[Medium] Price\n",
"0 1.0 0.0 0.0 120\n",
"1 0.0 1.0 0.0 83\n",
"2 0.0 0.0 1.0 80\n",
"3 0.0 0.0 1.0 97\n",
"4 1.0 0.0 0.0 128\n",
"5 1.0 0.0 0.0 72\n",
"6 0.0 0.0 1.0 108\n",
"7 0.0 1.0 0.0 120\n",
"8 0.0 0.0 1.0 124\n",
"9 0.0 0.0 1.0 124"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from ISLP.models import contrast\n",
"shelve = contrast('ShelveLoc', None)\n",
"MS_contr = ModelSpec([shelve, 'Price'], intercept=False)\n",
"MS_contr.fit_transform(Carseats)[:10]"
]
},
{
"cell_type": "markdown",
"id": "66db03cf-489c-40b6-8fac-762d66cf9932",
"metadata": {},
"source": [
"This example above illustrates that columns need not be identified by name in `terms`. The basic\n",
"role of an item in the `terms` sequence is a description of how to extract a column\n",
"from a columnar data object, usually a `pd.DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "852ee40e-05d2-4785-ab7d-968fb087f3c0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=(), encoder=Contrast(method=None))"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"shelve"
]
},
{
"cell_type": "markdown",
"id": "b3be8808-1dbf-4154-882b-f61656a2ed4e",
"metadata": {},
"source": [
"The `Column` object can be used to directly extract relevant columns from a `pd.DataFrame`. If the `encoder` field is not\n",
"`None`, then the extracted columns will be passed through `encoder`.\n",
"The `get_columns` method produces these columns as well as names for the columns."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0ebadfc0-0ea2-4abc-aac6-ef78be227ce1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([[1., 0., 0.],\n",
" [0., 1., 0.],\n",
" [0., 0., 1.],\n",
" ...,\n",
" [0., 0., 1.],\n",
" [1., 0., 0.],\n",
" [0., 1., 0.]]),\n",
" ['ShelveLoc[Bad]', 'ShelveLoc[Good]', 'ShelveLoc[Medium]'])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"shelve.get_columns(Carseats)"
]
},
{
"cell_type": "markdown",
"id": "269e6d18-4ae4-4a77-8498-90281ae7c803",
"metadata": {},
"source": [
"Let's now fit a simple OLS model with this design."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "411238d0-dd36-4878-a869-e8ce0ada099c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coef | \n",
" std err | \n",
" t | \n",
" P>|t| | \n",
"
\n",
" \n",
" \n",
" \n",
" ShelveLoc[Bad] | \n",
" 12.0018 | \n",
" 0.503 | \n",
" 23.839 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ShelveLoc[Good] | \n",
" 16.8976 | \n",
" 0.522 | \n",
" 32.386 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ShelveLoc[Medium] | \n",
" 13.8638 | \n",
" 0.487 | \n",
" 28.467 | \n",
" 0.0 | \n",
"
\n",
" \n",
" Price | \n",
" -0.0567 | \n",
" 0.004 | \n",
" -13.967 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coef std err t P>|t|\n",
"ShelveLoc[Bad] 12.0018 0.503 23.839 0.0\n",
"ShelveLoc[Good] 16.8976 0.522 32.386 0.0\n",
"ShelveLoc[Medium] 13.8638 0.487 28.467 0.0\n",
"Price -0.0567 0.004 -13.967 0.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = MS_contr.transform(Carseats)\n",
"Y = Carseats['Sales']\n",
"M_ols = sm.OLS(Y, X).fit()\n",
"summarize(M_ols)"
]
},
{
"cell_type": "markdown",
"id": "40ddf68e-7d58-4e30-93a8-5b7fe840d37a",
"metadata": {},
"source": [
"## Interactions\n",
"\n",
"One of the common uses of formulae in `R` is to specify interactions between variables.\n",
"This is done in `ModelSpec` by including a tuple in the `terms` argument."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "3f5e314c-7a7f-4e8d-bb07-295beb42c728",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" intercept | \n",
" ShelveLoc[Bad]:Price | \n",
" ShelveLoc[Good]:Price | \n",
" ShelveLoc[Medium]:Price | \n",
" Price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 120.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 83.0 | \n",
" 0.0 | \n",
" 83 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 80.0 | \n",
" 80 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 97.0 | \n",
" 97 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 128.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 128 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.0 | \n",
" 72.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 72 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 108.0 | \n",
" 108 | \n",
"
\n",
" \n",
" 7 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 120.0 | \n",
" 0.0 | \n",
" 120 | \n",
"
\n",
" \n",
" 8 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 124.0 | \n",
" 124 | \n",
"
\n",
" \n",
" 9 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 124.0 | \n",
" 124 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" intercept ShelveLoc[Bad]:Price ShelveLoc[Good]:Price \\\n",
"0 1.0 120.0 0.0 \n",
"1 1.0 0.0 83.0 \n",
"2 1.0 0.0 0.0 \n",
"3 1.0 0.0 0.0 \n",
"4 1.0 128.0 0.0 \n",
"5 1.0 72.0 0.0 \n",
"6 1.0 0.0 0.0 \n",
"7 1.0 0.0 120.0 \n",
"8 1.0 0.0 0.0 \n",
"9 1.0 0.0 0.0 \n",
"\n",
" ShelveLoc[Medium]:Price Price \n",
"0 0.0 120 \n",
"1 0.0 83 \n",
"2 80.0 80 \n",
"3 97.0 97 \n",
"4 0.0 128 \n",
"5 0.0 72 \n",
"6 108.0 108 \n",
"7 0.0 120 \n",
"8 124.0 124 \n",
"9 124.0 124 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ModelSpec([(shelve, 'Price'), 'Price']).fit_transform(Carseats).iloc[:10]"
]
},
{
"cell_type": "markdown",
"id": "3f85fcb2-f0ef-4c1b-a89f-fcf083937274",
"metadata": {},
"source": [
"The above design matrix is clearly rank deficient, as `ModelSpec` has not inspected the formula\n",
"and attempted to produce a corresponding matrix that may or may not match a user's intent."
]
},
{
"cell_type": "markdown",
"id": "excellent-hamilton",
"metadata": {},
"source": [
"## Ordinal variables\n",
"\n",
"Ordinal variables are handled by a corresponding encoder)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "going-administrator",
"metadata": {},
"outputs": [],
"source": [
"Carseats['OIncome'] = pd.cut(Carseats['Income'], \n",
" [0,50,90,200], \n",
" labels=['L','M','H'])\n",
"MS_order = ModelSpec(['OIncome']).fit(Carseats)"
]
},
{
"cell_type": "markdown",
"id": "5e1defb1-071b-4751-9358-b8d2f0b3412e",
"metadata": {},
"source": [
"Part of the `fit` method of `ModelSpec` involves inspection of the columns of `Carseats`. \n",
"The results of that inspection can be found in the `column_info_` attribute:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "050fb4ae-648d-429d-9cb2-8423ad9707d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Sales': Column(idx='Sales', name='Sales', is_categorical=False, is_ordinal=False, columns=('Sales',), encoder=None),\n",
" 'CompPrice': Column(idx='CompPrice', name='CompPrice', is_categorical=False, is_ordinal=False, columns=('CompPrice',), encoder=None),\n",
" 'Income': Column(idx='Income', name='Income', is_categorical=False, is_ordinal=False, columns=('Income',), encoder=None),\n",
" 'Advertising': Column(idx='Advertising', name='Advertising', is_categorical=False, is_ordinal=False, columns=('Advertising',), encoder=None),\n",
" 'Population': Column(idx='Population', name='Population', is_categorical=False, is_ordinal=False, columns=('Population',), encoder=None),\n",
" 'Price': Column(idx='Price', name='Price', is_categorical=False, is_ordinal=False, columns=('Price',), encoder=None),\n",
" 'ShelveLoc': Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=('ShelveLoc[Good]', 'ShelveLoc[Medium]'), encoder=Contrast()),\n",
" 'Age': Column(idx='Age', name='Age', is_categorical=False, is_ordinal=False, columns=('Age',), encoder=None),\n",
" 'Education': Column(idx='Education', name='Education', is_categorical=False, is_ordinal=False, columns=('Education',), encoder=None),\n",
" 'Urban': Column(idx='Urban', name='Urban', is_categorical=True, is_ordinal=False, columns=('Urban[Yes]',), encoder=Contrast()),\n",
" 'US': Column(idx='US', name='US', is_categorical=True, is_ordinal=False, columns=('US[Yes]',), encoder=Contrast()),\n",
" 'OIncome': Column(idx='OIncome', name='OIncome', is_categorical=True, is_ordinal=True, columns=('OIncome',), encoder=OrdinalEncoder())}"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MS_order.column_info_"
]
},
{
"cell_type": "markdown",
"id": "debf7e2e-0a9d-451b-866c-66c0df9f43e5",
"metadata": {},
"source": [
"## Structure of a `ModelSpec`\n",
"\n",
"The first argument to `ModelSpec` is stored as the `terms` attribute. Under the hood,\n",
"this sequence is inspected to produce the `terms_` attribute which specify the objects\n",
"that will ultimately create the design matrix."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ea51e988-0857-4d49-9987-d7531b34a233",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Feature(variables=('ShelveLoc',), name='ShelveLoc', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),\n",
" Feature(variables=('Price',), name='Price', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MS = ModelSpec(['ShelveLoc', 'Price'])\n",
"MS.fit(Carseats)\n",
"MS.terms_"
]
},
{
"cell_type": "markdown",
"id": "warming-mobile",
"metadata": {},
"source": [
"Each element of `terms_` should be a `Feature` which describes a set of columns to be extracted from\n",
"a columnar data form as well as possible a possible encoder."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "59214a70-1e6b-41c4-9f44-a92d340723c9",
"metadata": {},
"outputs": [],
"source": [
"shelve_var = MS.terms_[0]"
]
},
{
"cell_type": "markdown",
"id": "5fed3ea2-ff50-4e5d-819d-a948f121f9d3",
"metadata": {},
"source": [
"We can find the columns associated to each term using the `build_columns` method of `ModelSpec`:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "5e25ef64-497d-4f42-9f20-3d4a320cda23",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ShelveLoc[Good] | \n",
" ShelveLoc[Medium] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 395 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 396 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 397 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 398 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 399 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
400 rows × 2 columns
\n",
"
"
],
"text/plain": [
" ShelveLoc[Good] ShelveLoc[Medium]\n",
"0 0.0 0.0\n",
"1 1.0 0.0\n",
"2 0.0 1.0\n",
"3 0.0 1.0\n",
"4 0.0 0.0\n",
".. ... ...\n",
"395 1.0 0.0\n",
"396 0.0 1.0\n",
"397 0.0 1.0\n",
"398 0.0 0.0\n",
"399 1.0 0.0\n",
"\n",
"[400 rows x 2 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df, names = build_columns(MS.column_info_,\n",
" Carseats, \n",
" shelve_var)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "63edf7a2-e776-45b0-b434-d676d7e13dbd",
"metadata": {},
"source": [
"The design matrix is constructed by running through `terms_` and concatenating the corresponding columns."
]
},
{
"cell_type": "markdown",
"id": "former-spring",
"metadata": {},
"source": [
"### `Feature` objects\n",
"\n",
"Note that `Feature` objects have a tuple of `variables` as well as an `encoder` attribute. The\n",
"tuple of `variables` first creates a concatenated dataframe from all corresponding variables and then\n",
"is run through `encoder.transform`. The `encoder.fit` method of each `Feature` is run once during \n",
"the call to `ModelSpec.fit`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "floral-liabilities",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Price | \n",
" Income | \n",
" OIncome | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 120.0 | \n",
" 73.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 83.0 | \n",
" 48.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 80.0 | \n",
" 35.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 97.0 | \n",
" 100.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 128.0 | \n",
" 64.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 395 | \n",
" 128.0 | \n",
" 108.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 396 | \n",
" 120.0 | \n",
" 23.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 397 | \n",
" 159.0 | \n",
" 26.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 398 | \n",
" 95.0 | \n",
" 79.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 399 | \n",
" 120.0 | \n",
" 37.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
400 rows × 3 columns
\n",
"
"
],
"text/plain": [
" Price Income OIncome\n",
"0 120.0 73.0 2.0\n",
"1 83.0 48.0 1.0\n",
"2 80.0 35.0 1.0\n",
"3 97.0 100.0 0.0\n",
"4 128.0 64.0 2.0\n",
".. ... ... ...\n",
"395 128.0 108.0 0.0\n",
"396 120.0 23.0 1.0\n",
"397 159.0 26.0 1.0\n",
"398 95.0 79.0 2.0\n",
"399 120.0 37.0 1.0\n",
"\n",
"[400 rows x 3 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=None)\n",
"build_columns(MS.column_info_,\n",
" Carseats, \n",
" new_var)[0]"
]
},
{
"cell_type": "markdown",
"id": "reasonable-canadian",
"metadata": {},
"source": [
"Let's now transform these columns with an encoder. Within `ModelSpec` we will first build the\n",
"arrays above and then call `pca.fit` and finally `pca.transform` within `design.build_columns`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "imported-measure",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mynewvar[0] | \n",
" mynewvar[1] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -3.595740 | \n",
" -4.850530 | \n",
"
\n",
" \n",
" 1 | \n",
" 15.070401 | \n",
" 35.706773 | \n",
"
\n",
" \n",
" 2 | \n",
" 27.412228 | \n",
" 40.772377 | \n",
"
\n",
" \n",
" 3 | \n",
" -33.983048 | \n",
" 13.468087 | \n",
"
\n",
" \n",
" 4 | \n",
" 6.580644 | \n",
" -11.287452 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 395 | \n",
" -36.856308 | \n",
" -18.418138 | \n",
"
\n",
" \n",
" 396 | \n",
" 45.731520 | \n",
" 3.243768 | \n",
"
\n",
" \n",
" 397 | \n",
" 49.087659 | \n",
" -35.727136 | \n",
"
\n",
" \n",
" 398 | \n",
" -13.565178 | \n",
" 18.847760 | \n",
"
\n",
" \n",
" 399 | \n",
" 31.917072 | \n",
" 0.976615 | \n",
"
\n",
" \n",
"
\n",
"
400 rows × 2 columns
\n",
"
"
],
"text/plain": [
" mynewvar[0] mynewvar[1]\n",
"0 -3.595740 -4.850530\n",
"1 15.070401 35.706773\n",
"2 27.412228 40.772377\n",
"3 -33.983048 13.468087\n",
"4 6.580644 -11.287452\n",
".. ... ...\n",
"395 -36.856308 -18.418138\n",
"396 45.731520 3.243768\n",
"397 49.087659 -35.727136\n",
"398 -13.565178 18.847760\n",
"399 31.917072 0.976615\n",
"\n",
"[400 rows x 2 columns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"pca = PCA(n_components=2)\n",
"pca.fit(build_columns(MS.column_info_, Carseats, new_var)[0]) # this is done within `ModelSpec.fit`\n",
"pca_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=pca)\n",
"build_columns(MS.column_info_,\n",
" Carseats, \n",
" pca_var)[0]"
]
},
{
"cell_type": "markdown",
"id": "institutional-burden",
"metadata": {},
"source": [
"The elements of the `variables` attribute may be column identifiers ( `\"Price\"`), `Column` instances (`price`)\n",
"or `Feature` instances (`pca_var`)."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "western-bloom",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Income | \n",
" Price | \n",
" mynewvar[0] | \n",
" mynewvar[1] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 73.0 | \n",
" 120.0 | \n",
" -3.595740 | \n",
" -4.850530 | \n",
"
\n",
" \n",
" 1 | \n",
" 48.0 | \n",
" 83.0 | \n",
" 15.070401 | \n",
" 35.706773 | \n",
"
\n",
" \n",
" 2 | \n",
" 35.0 | \n",
" 80.0 | \n",
" 27.412228 | \n",
" 40.772377 | \n",
"
\n",
" \n",
" 3 | \n",
" 100.0 | \n",
" 97.0 | \n",
" -33.983048 | \n",
" 13.468087 | \n",
"
\n",
" \n",
" 4 | \n",
" 64.0 | \n",
" 128.0 | \n",
" 6.580644 | \n",
" -11.287452 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 395 | \n",
" 108.0 | \n",
" 128.0 | \n",
" -36.856308 | \n",
" -18.418138 | \n",
"
\n",
" \n",
" 396 | \n",
" 23.0 | \n",
" 120.0 | \n",
" 45.731520 | \n",
" 3.243768 | \n",
"
\n",
" \n",
" 397 | \n",
" 26.0 | \n",
" 159.0 | \n",
" 49.087659 | \n",
" -35.727136 | \n",
"
\n",
" \n",
" 398 | \n",
" 79.0 | \n",
" 95.0 | \n",
" -13.565178 | \n",
" 18.847760 | \n",
"
\n",
" \n",
" 399 | \n",
" 37.0 | \n",
" 120.0 | \n",
" 31.917072 | \n",
" 0.976615 | \n",
"
\n",
" \n",
"
\n",
"
400 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Income Price mynewvar[0] mynewvar[1]\n",
"0 73.0 120.0 -3.595740 -4.850530\n",
"1 48.0 83.0 15.070401 35.706773\n",
"2 35.0 80.0 27.412228 40.772377\n",
"3 100.0 97.0 -33.983048 13.468087\n",
"4 64.0 128.0 6.580644 -11.287452\n",
".. ... ... ... ...\n",
"395 108.0 128.0 -36.856308 -18.418138\n",
"396 23.0 120.0 45.731520 3.243768\n",
"397 26.0 159.0 49.087659 -35.727136\n",
"398 79.0 95.0 -13.565178 18.847760\n",
"399 37.0 120.0 31.917072 0.976615\n",
"\n",
"[400 rows x 4 columns]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"price = MS.column_info_['Price']\n",
"fancy_var = Feature(('Income', price, pca_var), name='fancy', encoder=None)\n",
"build_columns(MS.column_info_,\n",
" Carseats, \n",
" fancy_var)[0]"
]
},
{
"cell_type": "markdown",
"id": "e289feba-e3f5-48e0-9e29-cdd88d7f9923",
"metadata": {},
"source": [
"## Predicting at new points"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "6efed2fa-9e5d-429c-a8d9-ac544cab2b41",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"intercept 12.661546\n",
"Price -0.052213\n",
"Income 0.012829\n",
"dtype: float64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MS = ModelSpec(['Price', 'Income']).fit(Carseats)\n",
"X = MS.transform(Carseats)\n",
"Y = Carseats['Sales']\n",
"M_ols = sm.OLS(Y, X).fit()\n",
"M_ols.params"
]
},
{
"cell_type": "markdown",
"id": "e6b4609b-fcb2-4cc2-b630-509df4c87546",
"metadata": {},
"source": [
"As `ModelSpec` is a transformer, it can be evaluated at new feature values.\n",
"Constructing the design matrix at any values is carried out by the `transform` method."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "8784b0e8-ce53-4a90-aee6-b935834295c7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([10.70130676, 10.307465 ])"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data = pd.DataFrame({'Price':[40, 50], 'Income':[10, 20]})\n",
"new_X = MS.transform(new_data)\n",
"M_ols.get_prediction(new_X).predicted_mean"
]
},
{
"cell_type": "markdown",
"id": "signal-yahoo",
"metadata": {},
"source": [
"## Using `np.ndarray`\n",
"\n",
"As the basic model is to concatenate columns extracted from a columnar data\n",
"representation, one *can* use `np.ndarray` as the column data. In this case,\n",
"columns will be selected by integer indices. \n",
"\n",
"### Caveats using `np.ndarray`\n",
"\n",
"If the `terms` only refer to a few columns of the data frame, the `transform` method only needs a dataframe with those columns.\n",
"However,\n",
"unless all features are floats, `np.ndarray` will default to a dtype of `object`, complicating issues.\n",
"\n",
"However, if we had used an `np.ndarray`, the column identifiers would be integers identifying specific columns so,\n",
"in order to work correctly, `transform` would need another `np.ndarray` where the columns have the same meaning. \n",
"\n",
"We illustrate this below, where we build a model from `Price` and `Income` for `Sales` and want to find predictions at new\n",
"values of `Price` and `Location`. We first find the predicitions using `pd.DataFrame` and then illustrate the difficulties\n",
"in using `np.ndarray`."
]
},
{
"cell_type": "markdown",
"id": "e7ffdd07-4d6b-4a4c-ab38-ad1270e85de6",
"metadata": {},
"source": [
"We will refit this model, using `ModelSpec` with an `np.ndarray` instead"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "4fec9030-7445-48be-a15f-2ac5a789e717",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1., 120., 73.],\n",
" [ 1., 83., 48.],\n",
" [ 1., 80., 35.],\n",
" ...,\n",
" [ 1., 159., 26.],\n",
" [ 1., 95., 79.],\n",
" [ 1., 120., 37.]])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Carseats_np = np.asarray(Carseats[['Price', 'Education', 'Income']])\n",
"MS_np = ModelSpec([0,2]).fit(Carseats_np)\n",
"MS_np.transform(Carseats_np)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "c864e365-2476-4ca6-9d27-625cac2b2271",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"const 12.661546\n",
"x1 -0.052213\n",
"x2 0.012829\n",
"dtype: float64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M_ols_np = sm.OLS(Y, MS_np.transform(Carseats_np)).fit()\n",
"M_ols_np.params"
]
},
{
"cell_type": "markdown",
"id": "undefined-sacrifice",
"metadata": {},
"source": [
"Now, let's consider finding the design matrix at new points. \n",
"When using `pd.DataFrame` we only need to supply the `transform` method\n",
"a data frame with columns implicated in the `terms` argument (in this case, `Price` and `Income`). \n",
"\n",
"However, when using `np.ndarray` with integers as indices, `Price` was column 0 and `Income` was column 2. The only\n",
"sensible way to produce a return for predict is to extract its 0th and 2nd columns. Note this means\n",
"that the meaning of columns in an `np.ndarray` provided to `transform` essentially must be identical to those\n",
"passed to `fit`."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "incredible-concert",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"index 2 is out of bounds for axis 1 with size 2\n"
]
}
],
"source": [
"try:\n",
" new_D = np.array([[40,50], [10,20]]).T\n",
" new_X = MS_np.transform(new_D)\n",
"except IndexError as e:\n",
" print(e)"
]
},
{
"cell_type": "markdown",
"id": "allied-botswana",
"metadata": {},
"source": [
"Ultimately, `M` expects 3 columns for new predictions because it was fit\n",
"with a matrix having 3 columns (the first representing an intercept).\n",
"\n",
"We might be tempted to try as with the `pd.DataFrame` and produce\n",
"an `np.ndarray` with only the necessary variables."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "stunning-container",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 1. 40. 10.]\n",
" [ 1. 50. 20.]]\n"
]
},
{
"data": {
"text/plain": [
"array([10.70130676, 10.307465 ])"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_D = np.array([[40,50], [np.nan, np.nan], [10,20]]).T\n",
"new_X = MS_np.transform(new_D)\n",
"print(new_X)\n",
"M_ols.get_prediction(new_X).predicted_mean"
]
},
{
"cell_type": "markdown",
"id": "specific-tobacco",
"metadata": {},
"source": [
"For more complicated design contructions ensuring the columns of `new_D` match that of the original data will be more cumbersome. We expect\n",
"then that `pd.DataFrame` (or a columnar data representation with similar API) will likely be easier to use with `ModelSpec`."
]
}
],
"metadata": {
"jupytext": {
"formats": "source/models///ipynb,jupyterbook/models///md:myst,jupyterbook/models///ipynb"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}