models.generic_selector#
Module: models.generic_selector
#
Inheritance diagram for ISLP.models.generic_selector
:
Stepwise model selection#
This package defines objects to carry out custom stepwise model selection.
FeatureSelector
#
- class ISLP.models.generic_selector.FeatureSelector(estimator, strategy, verbose=0, scoring=None, cv=None, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None)#
Bases:
MetaEstimatorMixin
Feature Selection for Classification and Regression.
- Parameters:
- estimator: scikit-learn classifier or regressor
- strategy: Strategy
Description of search strategy: a named tuple with fields initial_state, candidate_states, build_submodel, check_finished and postprocess.
- verbose: int (default: 0), level of verbosity to use in logging.
If 0, no output, if 1 number of features in current set, if 2 detailed logging including timestamp and cv scores at step.
- scoring: str, callable, or None (default: None)
If None (default), uses ‘accuracy’ for sklearn classifiers and ‘r2’ for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {‘mean_absolute_error’, ‘mean_squared_error’/’neg_mean_squared_error’, ‘median_absolute_error’, ‘r2’} for regressors. If a callable object or function is provided, it has to be conform with sklearn’s signature
scorer(estimator, X, y)
; see http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information.- cv: int (default: 5)
Integer or iterable yielding train, test splits. If cv is an integer and estimator is a classifier (or y consists of integer class labels) stratified k-fold. Otherwise regular k-fold cross-validation is performed. No cross-validation if cv is None, False, or 0.
- n_jobs: int (default: 1)
The number of CPUs to use for evaluating different feature subsets in parallel. -1 means ‘all CPUs’.
- pre_dispatch: int, or string (default: ‘2*n_jobs’)
Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned.
Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function
of n_jobs, as in 2*n_jobs
- clone_estimator: bool (default: True)
Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn’t implement scikit-learn’s set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1.
- Attributes:
- results_: dict
A dictionary of selected feature subsets during the selection, where the dictionary keys are the states of these feature selector. The dictionary values are dictionaries themselves with the following keys: ‘scores’ (list individual cross-validation scores)
‘avg_score’ (average cross-validation score)
Methods
fit
(X, y[, groups])Perform feature selection and learn model from training data.
fit_transform
(X, y[, groups])Fit to training data then reduce X to its most important features.
get_metric_dict
([confidence_interval])Return metric dictionary
transform
(X)Reduce X to its most important features.
update_results_check
(results, path, best, ...)Update results_ with current batch and return a boolean about whether we should continue or not.
Notes
See Strategy for explanation of the fields.
Examples
For usage examples, please see TBD
- __init__(estimator, strategy, verbose=0, scoring=None, cv=None, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None)#
- fit(X, y, groups=None, **params)#
Perform feature selection and learn model from training data.
- Parameters:
- X: {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
- y: array-like, shape = [n_samples]
Target values. New in v 0.13.0: pandas DataFrames are now also accepted as argument for y.
- groups: array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
- params: various, optional
Additional parameters that are being passed to the estimator. For example, sample_weights=weights.
- Returns:
- self: object
- fit_transform(X, y, groups=None, **params)#
Fit to training data then reduce X to its most important features.
- Parameters:
- X: {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
- y: array-like, shape = [n_samples]
Target values. New in v 0.13.0: a pandas Series are now also accepted as argument for y.
- groups: array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
- params: various, optional
Additional parameters that are being passed to the estimator. For example, sample_weights=weights.
- Returns:
- Reduced feature subset of X, shape={n_samples, k_features}
- get_metric_dict(confidence_interval=0.95)#
Return metric dictionary
- Parameters:
- confidence_interval: float (default: 0.95)
A positive float between 0.0 and 1.0 to compute the confidence interval bounds of the CV score averages.
- Returns:
- Dictionary with items where each dictionary value is a list
- with the number of iterations (number of feature subsets) as
- its length. The dictionary keys corresponding to these lists
- are as follows:
‘state’: tuple of the indices of the feature subset ‘scores’: list with individual CV scores ‘avg_score’: of CV average scores ‘std_dev’: standard deviation of the CV score average ‘std_err’: standard error of the CV score average ‘ci_bound’: confidence interval bound of the CV score average
- transform(X)#
Reduce X to its most important features.
- Parameters:
- X: {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
- Returns:
- Reduced feature subset of X, shape={n_samples, k_features}
- update_results_check(results, path, best, batch_results, check_finished)#
Update results_ with current batch and return a boolean about whether we should continue or not.
- Parameters:
- results: dict
Dictionary of all results. Keys are state with values dictionaries having keys scores, avg_scores.
- best(state, score)
Current best state and score.
- batch_results: dict
Dictionary of results from a batch fit. Keys are tate with values dictionaries having keys scores, avg_scores.
- check_finished: callable
Callable taking three arguments (results, best_state, batch_results) which determines if the state generator should step. Often will just check if there is a better score than that at current best state but can use entire set of results if desired.
- Returns:
- best_state: object
State that had the best avg_score
- fitted: bool
If batch_results is empty, fitting has terminated so return True. Otherwise False.