models.generic_selector#

Module: models.generic_selector#

Inheritance diagram for ISLP.models.generic_selector:

digraph inheritance0613334a70 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "models.generic_selector.FeatureSelector" [URL="#ISLP.models.generic_selector.FeatureSelector",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Feature Selection for Classification and Regression."]; "sklearn.base.MetaEstimatorMixin" -> "models.generic_selector.FeatureSelector" [arrowsize=0.5,style="setlinewidth(0.5)"]; "sklearn.base.MetaEstimatorMixin" [URL="https://scikit-learn.org/stable/modules/generated/sklearn.base.MetaEstimatorMixin.html#sklearn.base.MetaEstimatorMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Mixin class for all meta estimators in scikit-learn."]; }

Stepwise model selection#

This package defines objects to carry out custom stepwise model selection.

FeatureSelector#

class ISLP.models.generic_selector.FeatureSelector(estimator, strategy, verbose=0, scoring=None, cv=None, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None)#

Bases: MetaEstimatorMixin

Feature Selection for Classification and Regression.

Parameters:
estimator: scikit-learn classifier or regressor
strategy: Strategy

Description of search strategy: a named tuple with fields initial_state, candidate_states, build_submodel, check_finished and postprocess.

verbose: int (default: 0), level of verbosity to use in logging.

If 0, no output, if 1 number of features in current set, if 2 detailed logging including timestamp and cv scores at step.

scoring: str, callable, or None (default: None)

If None (default), uses ‘accuracy’ for sklearn classifiers and ‘r2’ for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {‘mean_absolute_error’, ‘mean_squared_error’/’neg_mean_squared_error’, ‘median_absolute_error’, ‘r2’} for regressors. If a callable object or function is provided, it has to be conform with sklearn’s signature scorer(estimator, X, y); see http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information.

cv: int (default: 5)

Integer or iterable yielding train, test splits. If cv is an integer and estimator is a classifier (or y consists of integer class labels) stratified k-fold. Otherwise regular k-fold cross-validation is performed. No cross-validation if cv is None, False, or 0.

n_jobs: int (default: 1)

The number of CPUs to use for evaluating different feature subsets in parallel. -1 means ‘all CPUs’.

pre_dispatch: int, or string (default: ‘2*n_jobs’)

Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned.

Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs

An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function

of n_jobs, as in 2*n_jobs

clone_estimator: bool (default: True)

Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn’t implement scikit-learn’s set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1.

Attributes:
results_: dict

A dictionary of selected feature subsets during the selection, where the dictionary keys are the states of these feature selector. The dictionary values are dictionaries themselves with the following keys: ‘scores’ (list individual cross-validation scores)

‘avg_score’ (average cross-validation score)

Methods

fit(X, y[, groups])

Perform feature selection and learn model from training data.

fit_transform(X, y[, groups])

Fit to training data then reduce X to its most important features.

get_metric_dict([confidence_interval])

Return metric dictionary

transform(X)

Reduce X to its most important features.

update_results_check(results, path, best, ...)

Update results_ with current batch and return a boolean about whether we should continue or not.

Notes

See Strategy for explanation of the fields.

Examples

For usage examples, please see TBD

__init__(estimator, strategy, verbose=0, scoring=None, cv=None, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None)#
fit(X, y, groups=None, **params)#

Perform feature selection and learn model from training data.

Parameters:
X: {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.

y: array-like, shape = [n_samples]

Target values. New in v 0.13.0: pandas DataFrames are now also accepted as argument for y.

groups: array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.

params: various, optional

Additional parameters that are being passed to the estimator. For example, sample_weights=weights.

Returns:
self: object
fit_transform(X, y, groups=None, **params)#

Fit to training data then reduce X to its most important features.

Parameters:
X: {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.

y: array-like, shape = [n_samples]

Target values. New in v 0.13.0: a pandas Series are now also accepted as argument for y.

groups: array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.

params: various, optional

Additional parameters that are being passed to the estimator. For example, sample_weights=weights.

Returns:
Reduced feature subset of X, shape={n_samples, k_features}
get_metric_dict(confidence_interval=0.95)#

Return metric dictionary

Parameters:
confidence_interval: float (default: 0.95)

A positive float between 0.0 and 1.0 to compute the confidence interval bounds of the CV score averages.

Returns:
Dictionary with items where each dictionary value is a list
with the number of iterations (number of feature subsets) as
its length. The dictionary keys corresponding to these lists
are as follows:

‘state’: tuple of the indices of the feature subset ‘scores’: list with individual CV scores ‘avg_score’: of CV average scores ‘std_dev’: standard deviation of the CV score average ‘std_err’: standard error of the CV score average ‘ci_bound’: confidence interval bound of the CV score average

transform(X)#

Reduce X to its most important features.

Parameters:
X: {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.

Returns:
Reduced feature subset of X, shape={n_samples, k_features}
update_results_check(results, path, best, batch_results, check_finished)#

Update results_ with current batch and return a boolean about whether we should continue or not.

Parameters:
results: dict

Dictionary of all results. Keys are state with values dictionaries having keys scores, avg_scores.

best(state, score)

Current best state and score.

batch_results: dict

Dictionary of results from a batch fit. Keys are tate with values dictionaries having keys scores, avg_scores.

check_finished: callable

Callable taking three arguments (results, best_state, batch_results) which determines if the state generator should step. Often will just check if there is a better score than that at current best state but can use entire set of results if desired.

Returns:
best_state: object

State that had the best avg_score

fitted: bool

If batch_results is empty, fitting has terminated so return True. Otherwise False.