{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "typical-correlation",
   "metadata": {},
   "source": [
    "# Building design matrices with `ModelSpec`\n",
    "\n",
    "The `ISLP` package provides a facility to build design\n",
    "matrices for regression and classification tasks. It provides similar functionality to the formula\n",
    "notation of `R` though uses python objects rather than specification through the special formula syntax.\n",
    "\n",
    "Related tools include `patsy` and `ColumnTransformer` from `sklearn.compose`. \n",
    "\n",
    "Perhaps the most common use is to extract some columns from a `pd.DataFrame` and \n",
    "produce a design matrix, optionally with an intercept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "sticky-desperate",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from ISLP import load_data\n",
    "from ISLP.models import (ModelSpec,\n",
    "                         summarize,\n",
    "                         Column,\n",
    "                         Feature,\n",
    "                         build_columns)\n",
    "\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "devoted-antique",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',\n",
       "       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Carseats = load_data('Carseats')\n",
    "Carseats.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7a2e6ab-491d-4a57-8184-a9fcccb2047b",
   "metadata": {},
   "source": [
    "We'll first build a design matrix that we can use to model `Sales`\n",
    "in terms of the categorical variable `ShelveLoc` and `Price`.\n",
    "\n",
    "We see first that `ShelveLoc` is a categorical variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7d3642a6-90c6-48ad-8d35-88231b4991f8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         Bad\n",
       "1        Good\n",
       "2      Medium\n",
       "3      Medium\n",
       "4         Bad\n",
       "        ...  \n",
       "395      Good\n",
       "396    Medium\n",
       "397    Medium\n",
       "398       Bad\n",
       "399      Good\n",
       "Name: ShelveLoc, Length: 400, dtype: category\n",
       "Categories (3, object): ['Bad', 'Good', 'Medium']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Carseats['ShelveLoc']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4afa201d-4b19-4d85-9e1b-1392a54d027b",
   "metadata": {},
   "source": [
    "This is recognized by `ModelSpec` and only 2 columns are added for the three levels. The\n",
    "default behavior is to drop the first level of the categories. Later, \n",
    "we will show other contrasts of the 3 columns can be produced.  \n",
    "\n",
    "This simple example below illustrates how the first argument (its `terms`) is\n",
    "used to construct a design matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fd5528fe-11da-4e10-8996-06085896c1a0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>intercept</th>\n",
       "      <th>ShelveLoc[Good]</th>\n",
       "      <th>ShelveLoc[Medium]</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>83</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   intercept  ShelveLoc[Good]  ShelveLoc[Medium]  Price\n",
       "0        1.0              0.0                0.0    120\n",
       "1        1.0              1.0                0.0     83\n",
       "2        1.0              0.0                1.0     80\n",
       "3        1.0              0.0                1.0     97\n",
       "4        1.0              0.0                0.0    128\n",
       "5        1.0              0.0                0.0     72\n",
       "6        1.0              0.0                1.0    108\n",
       "7        1.0              1.0                0.0    120\n",
       "8        1.0              0.0                1.0    124\n",
       "9        1.0              0.0                1.0    124"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MS = ModelSpec(['ShelveLoc', 'Price'])\n",
    "X = MS.fit_transform(Carseats)\n",
    "X.iloc[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6948e1ef-3685-4840-a4f2-ef15a1bcfb69",
   "metadata": {},
   "source": [
    "We note that a column has been added for the intercept by default. This can be changed using the\n",
    "`intercept` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "682d4c81-eba9-467d-a176-911a0269a21d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ShelveLoc[Good]</th>\n",
       "      <th>ShelveLoc[Medium]</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>83</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ShelveLoc[Good]  ShelveLoc[Medium]  Price\n",
       "0              0.0                0.0    120\n",
       "1              1.0                0.0     83\n",
       "2              0.0                1.0     80\n",
       "3              0.0                1.0     97\n",
       "4              0.0                0.0    128\n",
       "5              0.0                0.0     72\n",
       "6              0.0                1.0    108\n",
       "7              1.0                0.0    120\n",
       "8              0.0                1.0    124\n",
       "9              0.0                1.0    124"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MS_no1 = ModelSpec(['ShelveLoc', 'Price'], intercept=False)\n",
    "MS_no1.fit_transform(Carseats)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54d8fd20-d8f5-44d6-9965-83e745680798",
   "metadata": {},
   "source": [
    "We see that `ShelveLoc` still only contributes\n",
    "two columns to the design. The `ModelSpec` object does no introspection of its arguments to effectively include an intercept term\n",
    "in the column space of the design matrix.\n",
    "\n",
    "To include this intercept via `ShelveLoc` we can use 3 columns to encode this categorical variable. Following the nomenclature of\n",
    "`R`, we call this a `Contrast` of the categorical variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "555734bb-2682-4721-a1cd-6fb207394b0e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ShelveLoc[Bad]</th>\n",
       "      <th>ShelveLoc[Good]</th>\n",
       "      <th>ShelveLoc[Medium]</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>83</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ShelveLoc[Bad]  ShelveLoc[Good]  ShelveLoc[Medium]  Price\n",
       "0             1.0              0.0                0.0    120\n",
       "1             0.0              1.0                0.0     83\n",
       "2             0.0              0.0                1.0     80\n",
       "3             0.0              0.0                1.0     97\n",
       "4             1.0              0.0                0.0    128\n",
       "5             1.0              0.0                0.0     72\n",
       "6             0.0              0.0                1.0    108\n",
       "7             0.0              1.0                0.0    120\n",
       "8             0.0              0.0                1.0    124\n",
       "9             0.0              0.0                1.0    124"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ISLP.models import contrast\n",
    "shelve = contrast('ShelveLoc', None)\n",
    "MS_contr = ModelSpec([shelve, 'Price'], intercept=False)\n",
    "MS_contr.fit_transform(Carseats)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66db03cf-489c-40b6-8fac-762d66cf9932",
   "metadata": {},
   "source": [
    "This example above illustrates that columns need not be identified by name in `terms`. The basic\n",
    "role of an item in the `terms` sequence is a description of how to extract a column\n",
    "from a columnar data object, usually a `pd.DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "852ee40e-05d2-4785-ab7d-968fb087f3c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=(), encoder=Contrast(method=None))"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "shelve"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3be8808-1dbf-4154-882b-f61656a2ed4e",
   "metadata": {},
   "source": [
    "The `Column` object can be used to directly extract relevant columns from a `pd.DataFrame`. If the `encoder` field is not\n",
    "`None`, then the extracted columns will be passed through `encoder`.\n",
    "The `get_columns` method produces these columns as well as names for the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0ebadfc0-0ea2-4abc-aac6-ef78be227ce1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([[1., 0., 0.],\n",
       "        [0., 1., 0.],\n",
       "        [0., 0., 1.],\n",
       "        ...,\n",
       "        [0., 0., 1.],\n",
       "        [1., 0., 0.],\n",
       "        [0., 1., 0.]]),\n",
       " ['ShelveLoc[Bad]', 'ShelveLoc[Good]', 'ShelveLoc[Medium]'])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "shelve.get_columns(Carseats)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "269e6d18-4ae4-4a77-8498-90281ae7c803",
   "metadata": {},
   "source": [
    "Let's now fit a simple OLS model with this design."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "411238d0-dd36-4878-a869-e8ce0ada099c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>coef</th>\n",
       "      <th>std err</th>\n",
       "      <th>t</th>\n",
       "      <th>P&gt;|t|</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ShelveLoc[Bad]</th>\n",
       "      <td>12.0018</td>\n",
       "      <td>0.503</td>\n",
       "      <td>23.839</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ShelveLoc[Good]</th>\n",
       "      <td>16.8976</td>\n",
       "      <td>0.522</td>\n",
       "      <td>32.386</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ShelveLoc[Medium]</th>\n",
       "      <td>13.8638</td>\n",
       "      <td>0.487</td>\n",
       "      <td>28.467</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Price</th>\n",
       "      <td>-0.0567</td>\n",
       "      <td>0.004</td>\n",
       "      <td>-13.967</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      coef  std err       t  P>|t|\n",
       "ShelveLoc[Bad]     12.0018    0.503  23.839    0.0\n",
       "ShelveLoc[Good]    16.8976    0.522  32.386    0.0\n",
       "ShelveLoc[Medium]  13.8638    0.487  28.467    0.0\n",
       "Price              -0.0567    0.004 -13.967    0.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = MS_contr.transform(Carseats)\n",
    "Y = Carseats['Sales']\n",
    "M_ols = sm.OLS(Y, X).fit()\n",
    "summarize(M_ols)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40ddf68e-7d58-4e30-93a8-5b7fe840d37a",
   "metadata": {},
   "source": [
    "## Interactions\n",
    "\n",
    "One of the common uses of formulae in `R` is to specify interactions between variables.\n",
    "This is done in `ModelSpec` by including a tuple in the `terms` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3f5e314c-7a7f-4e8d-bb07-295beb42c728",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>intercept</th>\n",
       "      <th>ShelveLoc[Bad]:Price</th>\n",
       "      <th>ShelveLoc[Good]:Price</th>\n",
       "      <th>ShelveLoc[Medium]:Price</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>83.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>83</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>97.0</td>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>108.0</td>\n",
       "      <td>108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>124.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>124.0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   intercept  ShelveLoc[Bad]:Price  ShelveLoc[Good]:Price  \\\n",
       "0        1.0                 120.0                    0.0   \n",
       "1        1.0                   0.0                   83.0   \n",
       "2        1.0                   0.0                    0.0   \n",
       "3        1.0                   0.0                    0.0   \n",
       "4        1.0                 128.0                    0.0   \n",
       "5        1.0                  72.0                    0.0   \n",
       "6        1.0                   0.0                    0.0   \n",
       "7        1.0                   0.0                  120.0   \n",
       "8        1.0                   0.0                    0.0   \n",
       "9        1.0                   0.0                    0.0   \n",
       "\n",
       "   ShelveLoc[Medium]:Price  Price  \n",
       "0                      0.0    120  \n",
       "1                      0.0     83  \n",
       "2                     80.0     80  \n",
       "3                     97.0     97  \n",
       "4                      0.0    128  \n",
       "5                      0.0     72  \n",
       "6                    108.0    108  \n",
       "7                      0.0    120  \n",
       "8                    124.0    124  \n",
       "9                    124.0    124  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ModelSpec([(shelve, 'Price'), 'Price']).fit_transform(Carseats).iloc[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f85fcb2-f0ef-4c1b-a89f-fcf083937274",
   "metadata": {},
   "source": [
    "The above design matrix is clearly rank deficient, as `ModelSpec` has not inspected the formula\n",
    "and attempted to produce a corresponding matrix that may or may not match a user's intent."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "excellent-hamilton",
   "metadata": {},
   "source": [
    "## Ordinal variables\n",
    "\n",
    "Ordinal variables are handled by a corresponding encoder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "going-administrator",
   "metadata": {},
   "outputs": [],
   "source": [
    "Carseats['OIncome'] = pd.cut(Carseats['Income'], \n",
    "                             [0,50,90,200], \n",
    "                             labels=['L','M','H'])\n",
    "MS_order = ModelSpec(['OIncome']).fit(Carseats)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e1defb1-071b-4751-9358-b8d2f0b3412e",
   "metadata": {},
   "source": [
    "Part of the `fit` method of `ModelSpec` involves inspection of the columns of `Carseats`. \n",
    "The results of that inspection can be found in the `column_info_` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "050fb4ae-648d-429d-9cb2-8423ad9707d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Sales': Column(idx='Sales', name='Sales', is_categorical=False, is_ordinal=False, columns=('Sales',), encoder=None),\n",
       " 'CompPrice': Column(idx='CompPrice', name='CompPrice', is_categorical=False, is_ordinal=False, columns=('CompPrice',), encoder=None),\n",
       " 'Income': Column(idx='Income', name='Income', is_categorical=False, is_ordinal=False, columns=('Income',), encoder=None),\n",
       " 'Advertising': Column(idx='Advertising', name='Advertising', is_categorical=False, is_ordinal=False, columns=('Advertising',), encoder=None),\n",
       " 'Population': Column(idx='Population', name='Population', is_categorical=False, is_ordinal=False, columns=('Population',), encoder=None),\n",
       " 'Price': Column(idx='Price', name='Price', is_categorical=False, is_ordinal=False, columns=('Price',), encoder=None),\n",
       " 'ShelveLoc': Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=('ShelveLoc[Good]', 'ShelveLoc[Medium]'), encoder=Contrast()),\n",
       " 'Age': Column(idx='Age', name='Age', is_categorical=False, is_ordinal=False, columns=('Age',), encoder=None),\n",
       " 'Education': Column(idx='Education', name='Education', is_categorical=False, is_ordinal=False, columns=('Education',), encoder=None),\n",
       " 'Urban': Column(idx='Urban', name='Urban', is_categorical=True, is_ordinal=False, columns=('Urban[Yes]',), encoder=Contrast()),\n",
       " 'US': Column(idx='US', name='US', is_categorical=True, is_ordinal=False, columns=('US[Yes]',), encoder=Contrast()),\n",
       " 'OIncome': Column(idx='OIncome', name='OIncome', is_categorical=True, is_ordinal=True, columns=('OIncome',), encoder=OrdinalEncoder())}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MS_order.column_info_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "debf7e2e-0a9d-451b-866c-66c0df9f43e5",
   "metadata": {},
   "source": [
    "## Structure of a `ModelSpec`\n",
    "\n",
    "The first argument to `ModelSpec` is stored as the `terms` attribute. Under the hood,\n",
    "this sequence is inspected to produce the `terms_` attribute which specify the objects\n",
    "that will ultimately create the design matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ea51e988-0857-4d49-9987-d7531b34a233",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Feature(variables=('ShelveLoc',), name='ShelveLoc', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),\n",
       " Feature(variables=('Price',), name='Price', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MS = ModelSpec(['ShelveLoc', 'Price'])\n",
    "MS.fit(Carseats)\n",
    "MS.terms_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "warming-mobile",
   "metadata": {},
   "source": [
    "Each element of `terms_` should be a `Feature` which describes a set of columns to be extracted from\n",
    "a columnar data form as well as possible a possible encoder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "59214a70-1e6b-41c4-9f44-a92d340723c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "shelve_var = MS.terms_[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fed3ea2-ff50-4e5d-819d-a948f121f9d3",
   "metadata": {},
   "source": [
    "We can find the columns associated to each term using the `build_columns` method of `ModelSpec`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "5e25ef64-497d-4f42-9f20-3d4a320cda23",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ShelveLoc[Good]</th>\n",
       "      <th>ShelveLoc[Medium]</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>398</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>399</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>400 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     ShelveLoc[Good]  ShelveLoc[Medium]\n",
       "0                0.0                0.0\n",
       "1                1.0                0.0\n",
       "2                0.0                1.0\n",
       "3                0.0                1.0\n",
       "4                0.0                0.0\n",
       "..               ...                ...\n",
       "395              1.0                0.0\n",
       "396              0.0                1.0\n",
       "397              0.0                1.0\n",
       "398              0.0                0.0\n",
       "399              1.0                0.0\n",
       "\n",
       "[400 rows x 2 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df, names = build_columns(MS.column_info_,\n",
    "                          Carseats, \n",
    "                          shelve_var)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63edf7a2-e776-45b0-b434-d676d7e13dbd",
   "metadata": {},
   "source": [
    "The design matrix is constructed by running through `terms_` and concatenating the corresponding columns."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "former-spring",
   "metadata": {},
   "source": [
    "### `Feature` objects\n",
    "\n",
    "Note that `Feature` objects have a tuple of `variables` as well as an `encoder` attribute. The\n",
    "tuple of `variables` first creates a concatenated dataframe from all corresponding variables and then\n",
    "is run through `encoder.transform`. The `encoder.fit` method of each `Feature` is run once during \n",
    "the call to `ModelSpec.fit`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "floral-liabilities",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price</th>\n",
       "      <th>Income</th>\n",
       "      <th>OIncome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>120.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>83.0</td>\n",
       "      <td>48.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>80.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>97.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>128.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395</th>\n",
       "      <td>128.0</td>\n",
       "      <td>108.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>120.0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>159.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>398</th>\n",
       "      <td>95.0</td>\n",
       "      <td>79.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>399</th>\n",
       "      <td>120.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>400 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Price  Income  OIncome\n",
       "0    120.0    73.0      2.0\n",
       "1     83.0    48.0      1.0\n",
       "2     80.0    35.0      1.0\n",
       "3     97.0   100.0      0.0\n",
       "4    128.0    64.0      2.0\n",
       "..     ...     ...      ...\n",
       "395  128.0   108.0      0.0\n",
       "396  120.0    23.0      1.0\n",
       "397  159.0    26.0      1.0\n",
       "398   95.0    79.0      2.0\n",
       "399  120.0    37.0      1.0\n",
       "\n",
       "[400 rows x 3 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=None)\n",
    "build_columns(MS.column_info_,\n",
    "              Carseats, \n",
    "              new_var)[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "reasonable-canadian",
   "metadata": {},
   "source": [
    "Let's now transform these columns with an encoder. Within `ModelSpec` we will first build the\n",
    "arrays above and then call `pca.fit` and finally `pca.transform` within `design.build_columns`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "imported-measure",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mynewvar[0]</th>\n",
       "      <th>mynewvar[1]</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-3.595740</td>\n",
       "      <td>-4.850530</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15.070401</td>\n",
       "      <td>35.706773</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>27.412228</td>\n",
       "      <td>40.772377</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-33.983048</td>\n",
       "      <td>13.468087</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6.580644</td>\n",
       "      <td>-11.287452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395</th>\n",
       "      <td>-36.856308</td>\n",
       "      <td>-18.418138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>45.731520</td>\n",
       "      <td>3.243768</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>49.087659</td>\n",
       "      <td>-35.727136</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>398</th>\n",
       "      <td>-13.565178</td>\n",
       "      <td>18.847760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>399</th>\n",
       "      <td>31.917072</td>\n",
       "      <td>0.976615</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>400 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     mynewvar[0]  mynewvar[1]\n",
       "0      -3.595740    -4.850530\n",
       "1      15.070401    35.706773\n",
       "2      27.412228    40.772377\n",
       "3     -33.983048    13.468087\n",
       "4       6.580644   -11.287452\n",
       "..           ...          ...\n",
       "395   -36.856308   -18.418138\n",
       "396    45.731520     3.243768\n",
       "397    49.087659   -35.727136\n",
       "398   -13.565178    18.847760\n",
       "399    31.917072     0.976615\n",
       "\n",
       "[400 rows x 2 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "pca = PCA(n_components=2)\n",
    "pca.fit(build_columns(MS.column_info_, Carseats, new_var)[0]) # this is done within `ModelSpec.fit`\n",
    "pca_var = Feature(('Price', 'Income', 'OIncome'), name='mynewvar', encoder=pca)\n",
    "build_columns(MS.column_info_,\n",
    "              Carseats, \n",
    "              pca_var)[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "institutional-burden",
   "metadata": {},
   "source": [
    "The elements of the `variables` attribute may be column identifiers ( `\"Price\"`), `Column` instances (`price`)\n",
    "or `Feature` instances (`pca_var`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "western-bloom",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Income</th>\n",
       "      <th>Price</th>\n",
       "      <th>mynewvar[0]</th>\n",
       "      <th>mynewvar[1]</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>73.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>-3.595740</td>\n",
       "      <td>-4.850530</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>48.0</td>\n",
       "      <td>83.0</td>\n",
       "      <td>15.070401</td>\n",
       "      <td>35.706773</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>35.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>27.412228</td>\n",
       "      <td>40.772377</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>100.0</td>\n",
       "      <td>97.0</td>\n",
       "      <td>-33.983048</td>\n",
       "      <td>13.468087</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>64.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>6.580644</td>\n",
       "      <td>-11.287452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395</th>\n",
       "      <td>108.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>-36.856308</td>\n",
       "      <td>-18.418138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>23.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>45.731520</td>\n",
       "      <td>3.243768</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>26.0</td>\n",
       "      <td>159.0</td>\n",
       "      <td>49.087659</td>\n",
       "      <td>-35.727136</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>398</th>\n",
       "      <td>79.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>-13.565178</td>\n",
       "      <td>18.847760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>399</th>\n",
       "      <td>37.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>31.917072</td>\n",
       "      <td>0.976615</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>400 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Income  Price  mynewvar[0]  mynewvar[1]\n",
       "0      73.0  120.0    -3.595740    -4.850530\n",
       "1      48.0   83.0    15.070401    35.706773\n",
       "2      35.0   80.0    27.412228    40.772377\n",
       "3     100.0   97.0   -33.983048    13.468087\n",
       "4      64.0  128.0     6.580644   -11.287452\n",
       "..      ...    ...          ...          ...\n",
       "395   108.0  128.0   -36.856308   -18.418138\n",
       "396    23.0  120.0    45.731520     3.243768\n",
       "397    26.0  159.0    49.087659   -35.727136\n",
       "398    79.0   95.0   -13.565178    18.847760\n",
       "399    37.0  120.0    31.917072     0.976615\n",
       "\n",
       "[400 rows x 4 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "price = MS.column_info_['Price']\n",
    "fancy_var = Feature(('Income', price, pca_var), name='fancy', encoder=None)\n",
    "build_columns(MS.column_info_,\n",
    "              Carseats, \n",
    "              fancy_var)[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e289feba-e3f5-48e0-9e29-cdd88d7f9923",
   "metadata": {},
   "source": [
    "## Predicting at new points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "6efed2fa-9e5d-429c-a8d9-ac544cab2b41",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "intercept    12.661546\n",
       "Price        -0.052213\n",
       "Income        0.012829\n",
       "dtype: float64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MS = ModelSpec(['Price', 'Income']).fit(Carseats)\n",
    "X = MS.transform(Carseats)\n",
    "Y = Carseats['Sales']\n",
    "M_ols = sm.OLS(Y, X).fit()\n",
    "M_ols.params"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6b4609b-fcb2-4cc2-b630-509df4c87546",
   "metadata": {},
   "source": [
    "As `ModelSpec` is a transformer, it can be evaluated at new feature values.\n",
    "Constructing the design matrix at any values is carried out by the `transform` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "8784b0e8-ce53-4a90-aee6-b935834295c7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([10.70130676, 10.307465  ])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_data = pd.DataFrame({'Price':[40, 50], 'Income':[10, 20]})\n",
    "new_X = MS.transform(new_data)\n",
    "M_ols.get_prediction(new_X).predicted_mean"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "signal-yahoo",
   "metadata": {},
   "source": [
    "## Using `np.ndarray`\n",
    "\n",
    "As the basic model is to concatenate columns extracted from a columnar data\n",
    "representation, one *can* use `np.ndarray` as the column data. In this case,\n",
    "columns will be selected by integer indices. \n",
    "\n",
    "### Caveats using `np.ndarray`\n",
    "\n",
    "If the `terms` only refer to a few columns of the data frame, the `transform` method only needs a dataframe with those columns.\n",
    "However,\n",
    "unless all features are floats, `np.ndarray` will default to a dtype of `object`, complicating issues.\n",
    "\n",
    "However, if we had used an `np.ndarray`, the column identifiers would be integers identifying specific columns so,\n",
    "in order to work correctly, `transform` would need another `np.ndarray` where the columns have the same meaning. \n",
    "\n",
    "We illustrate this below, where we build a model from `Price` and `Income` for `Sales` and want to find predictions at new\n",
    "values of `Price` and `Location`. We first find the predicitions using `pd.DataFrame` and then illustrate the difficulties\n",
    "in using `np.ndarray`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7ffdd07-4d6b-4a4c-ab38-ad1270e85de6",
   "metadata": {},
   "source": [
    "We will refit this model, using `ModelSpec` with an `np.ndarray` instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "4fec9030-7445-48be-a15f-2ac5a789e717",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  1., 120.,  73.],\n",
       "       [  1.,  83.,  48.],\n",
       "       [  1.,  80.,  35.],\n",
       "       ...,\n",
       "       [  1., 159.,  26.],\n",
       "       [  1.,  95.,  79.],\n",
       "       [  1., 120.,  37.]])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Carseats_np = np.asarray(Carseats[['Price', 'Education', 'Income']])\n",
    "MS_np = ModelSpec([0,2]).fit(Carseats_np)\n",
    "MS_np.transform(Carseats_np)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "c864e365-2476-4ca6-9d27-625cac2b2271",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "const    12.661546\n",
       "x1       -0.052213\n",
       "x2        0.012829\n",
       "dtype: float64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M_ols_np = sm.OLS(Y, MS_np.transform(Carseats_np)).fit()\n",
    "M_ols_np.params"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "undefined-sacrifice",
   "metadata": {},
   "source": [
    "Now, let's consider finding the design matrix at new points. \n",
    "When using `pd.DataFrame` we only need to supply the `transform` method\n",
    "a data frame with columns implicated in the `terms` argument (in this case, `Price` and `Income`). \n",
    "\n",
    "However, when using `np.ndarray` with integers as indices, `Price` was column 0 and `Income` was column 2. The only\n",
    "sensible way to produce a return for predict is to extract its 0th and 2nd columns. Note this means\n",
    "that the meaning of columns in an `np.ndarray` provided to `transform` essentially must be identical to those\n",
    "passed to `fit`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "incredible-concert",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "index 2 is out of bounds for axis 1 with size 2\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    new_D = np.array([[40,50], [10,20]]).T\n",
    "    new_X = MS_np.transform(new_D)\n",
    "except IndexError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "allied-botswana",
   "metadata": {},
   "source": [
    "Ultimately, `M` expects 3 columns for new predictions because it was fit\n",
    "with a matrix having 3 columns (the first representing an intercept).\n",
    "\n",
    "We might be tempted to try as with the `pd.DataFrame` and produce\n",
    "an `np.ndarray` with only the necessary variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "stunning-container",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 1. 40. 10.]\n",
      " [ 1. 50. 20.]]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([10.70130676, 10.307465  ])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_D = np.array([[40,50], [np.nan, np.nan], [10,20]]).T\n",
    "new_X = MS_np.transform(new_D)\n",
    "print(new_X)\n",
    "M_ols.get_prediction(new_X).predicted_mean"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "specific-tobacco",
   "metadata": {},
   "source": [
    "For more complicated design contructions ensuring the columns of `new_D` match that of the original data will be more cumbersome. We expect\n",
    "then that `pd.DataFrame` (or a columnar data representation with similar API) will likely be easier to use with `ModelSpec`."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "source/models///ipynb,jupyterbook/models///md:myst,jupyterbook/models///ipynb"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}