flotilla.compute.predict module

Compute predictors on data, e.g. classify or regress on features/samples

class flotilla.compute.predict.Classifier(data_name, trait_name, predictor_name=None, *args, **kwargs)[source]

Bases: flotilla.compute.predict.PredictorBase

Classifier for categorical response variables. A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.
Parameters:

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

categorical = True
class flotilla.compute.predict.ConfigOptimizer[source]

Bases: object

choose the coef that makes some result most likely at all n_features (or some other function of the dataset)

static objective_average_times_seen(n_features, coef=2.5, max_feature_scaler=<function max_feature_scaler at 0x2b5e9dd3b7d0>, n_estimators_scaler=<function n_estimators_scaler at 0x2b5e9dd3b848>)[source]

I have no idea what this does. @mlovci

Parameters:

n_features : int

???

coef : float

???

max_feature_scaler : function

???

n_estimators_scaler : function

???

Returns:

???

class flotilla.compute.predict.PredictorBase(predictor_name, data_name, trait_name, X_data=None, trait=None, predictor_obj=None, predictor_scoring_fun=None, score_cutoff_fun=None, n_features_dependent_kwargs=None, constant_kwargs=None, is_categorical_trait=None, predictor_dataset_manager=None, predictor_config_manager=None, feature_renamer=None, groupby=None, color=None, pooled=None, order=None, violinplot_kws=None, data_type=None, label_to_color=None, label_to_marker=None, singles=None, outliers=None)[source]

Bases: object

A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.

Parameters:

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

X[source]

Predictive variables, aligned with target.

Thin reference to dataset.X

dataset[source]

Thin reference to dataset

fit()[source]

Fit predictor to the dataset

has_been_fit[source]

Thin reference to predictor.has_been_fit

has_been_scored[source]

Thin reference to predictor.has_been_scored

important_features_[source]

Get all features with scores greater than score_cutoff_

n_good_features_[source]

Get the number of good features

nmf(*args, **kwargs)[source]

Perform NMF on the top-performing features

oob_score_[source]

Thin reference to predictor.oob_score_

pca(*args, **kwargs)[source]

Perform PCA on the top-performing features

predict(*args, **kwargs)[source]

Predict

Parameters:

other : pandas.DataFrame

Given a (m_samples, n_features) dataframe, predict the response

Returns:

prediction : pandas.Series

(m_samples,) sized series of prediction of response

Raises:

TypeError

If other is not a pandas DataFrame

predictor[source]

Thin reference to dataset.predictor

score_coefficient[source]

Thin reference to predictor._score_coefficient

score_cutoff_[source]

Get the minimum score of the ‘good’ features

scores_[source]

Scores of these features’ importances in this predictor

subset_[source]

Get the subset of the data with only important features

y[source]

Target variable, aligned with predictive variables

Thin reference to dataset.y

class flotilla.compute.predict.PredictorConfig(predictor_name, obj, predictor_scoring_fun=<function default_predictor_scoring_fun at 0x2b5e9d900758>, score_cutoff_fun=<function default_score_cutoff_fun at 0x2b5e9dd3b500>, n_features_dependent_kwargs=None, **kwargs)[source]

Bases: object

A configuration for a predictor, names and tracks/sets parameters

Dynamically configures some args for predictor based on n_features (if this attribute exists) set general parameters with __init__ yield instances, set by your parameters, with __call__

Construct a predictor configuration

Parameters:

predictor_name : str

A name for this predictor

obj : sklearn predictor

A scikit-learn predictor, eg sklearn.ensemble.ExtraTreesClassifier

predictor_scoring_fun : function, optional

A function which returns the scores of a predictor. May be different for different predictor objects

score_cutoff_fun : function, optional

A function which returns the minimum “good” score of a predictor

n_features_dependent_kwargs : dict, optional (default None)

A dictionary of (key, function) arguments for the classifier, for keyword arguments that are dependent on the dataset input size

kwargs : other keyword arguments, optional

All other keyword arguments are passed along to the predictor

parameters(*args, **kwargs)[source]

Given a number of features, return the appropriately scaled keyword arguments

Parameters:

n_features : int

Number of features in the data to scale appropriate keyword arguments to the predictor object

class flotilla.compute.predict.PredictorConfigManager[source]

Bases: object

Manage several predictor configurations

A container for predictor configurations, includes several built-ins @mlovci: built-ins such as ........ ? What is predictor_config vs new_predictor_config? Why are they separate?

Attributes

predictor_config :  
predictor_configs :  
builtin_predictor_configs :  

Methods

new_predictor_config(*args, **kwargs) Create a new predictor configuration
>>> pcm = PredictorConfigManager()
 
>>> # add a new type of predictor
 
>>> pcm.new_predictor_config(ExtraTreesClassifier, 'ExtraTreesClassifier',
 
... n_features_dependent_kwargs=  
... {‘max_features’: PredictorConfigScalers.max_feature_scaler,  
... ‘n_estimators’: PredictorConfigScalers.n_estimators_scaler,  
... ‘n_jobs’: PredictorConfigScalers.n_jobs_scaler},  
... bootstrap=True, random_state=0,  
... oob_score=True,  
... verbose=True})  

Construct a predictor configuration manager with ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier, and GradientBoostingRegressor as default predictors.

builtin_predictor_configs[source]

Names of the predictor configurations

new_predictor_config(*args, **kwargs)[source]

Create a new predictor configuration

Parameters:

name : str

Name of the predictor configuration

obj : sklearn predictor object, optional (default=None)

@mlovci: what is the point of setting the default to None if it’s not really allowed?

predictor_scoring_fun : function, optional (default=None)

If None, get feature scores from obj.feature_importances_

score_cutoff_fun : function, optional (default=None)

If None, get the cutoff for important features with by taking features with scores that are 2 standard deviations away from the mean score

n_features_dependent_kwargs : dict, optional (default=None)

A (key, function) dictionary of keyword argument names and functions which scale their values based on the dataset input size

kwargs : other keyword arguments

All other keyword arguments are passed to PredictorConfig

Returns:

predictorconfig : PredictorConfig

A predictor configuration

Raises:

ValueError

If obj is None and any of the other keyword arguments are None

KeyError

If obj is None and “name” is not already in predictor_configs

predictor_config(name, **kwargs)[source]

Create a new predictor configuration, added to predictors

Parameters:

name : str

Name of the predictor

kwargs : other keyword arguments, optional

All other keyword arguments are passed to predictor_configs()

Returns:

predictor : sklearn predictor

An initalized scikit-learn predictor

predictor_configs[source]

Dict of predictor configurations

class flotilla.compute.predict.PredictorConfigScalers[source]

Bases: object

Scale parameters specified in the keyword arugments based on the dataset size

static max_feature_scaler(n_features=500, coef=2.5)[source]

Scale the maximum number of features per estimator

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?

Parameters:

n_features : int, optional (default 500)

Number of features in the data

coef : float, optional (default 2.5)

# TODO: What does this do?

Returns:

n_features : int

Maximum number of features per estimator

Raises:

ValueError

If n_features is None

static n_estimators_scaler(n_features=500, coef=2.5)[source]

Scale the number of estimators based on the input features

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?

Parameters:

n_features : int, optional (default 500)

Number of features in the data

coef : float, optional (default 2.5)

# @mlovci TODO: What does this do?

Returns:

n_estimators : int

Number of estimators to use

Raises:

ValueError

If n_features is None

static n_jobs_scaler(n_features=500)[source]

Scale the number of jobs based on how many features are in the data

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?

Parameters:

n_features : int

Number of features in the data

Returns:

n_jobs : int

Number of jobs to use

Raises:

ValueError

If n_features is None

class flotilla.compute.predict.PredictorDataSet(data, trait, data_name='MyDataset', categorical_trait=False, predictor_config_manager=None)[source]

Bases: object

Store a (n_samples, n_features) matrix and (n_samples,) trait pair

In scikit-learn parlance, store an X (data of independent variables) and y (target prediction) pair

Parameters:

data : pandas.DataFrame

A (n_samples, n_features) datafarme

trait : pandas.Series

Raises:

data - X

trait - y data_name - name to store this dataset, to be used with trait.name categorical_trait - is y categorical?

X[source]

(n_samples, n_features) matrix

check_if_equal(data, trait, categorical_trait)[source]

Check if this is the same as another dataset.

Parameters:

data : pandas.DataFrame

Input data of another dataset

trait : pandas.Series

Response variable of another dataset

categorical_trait : bool

Whether or not trait is categorical

Raises:

AssertionError

If datasets are not the same

predictor(*args, **kwargs)[source]

A single, initialized PredictorConfig instance

Parameters:

name : str

Name of the predictor to retrieve or initialize

kwargs : other keyword arguments

All other keyword arguments are passed to PredictorConfig

Returns:

predictorconfig : PredictorConfig

An initialized scikit-learn classifier or regressor

predictors[source]

dict of PredictorConfig instances

The idea here is to keep the predictors tied to their datasets

traitset[source]

All unique values in self.trait

y[source]

(n_samples,) vector of traits

class flotilla.compute.predict.PredictorDataSetManager(predictor_config_manager=None)[source]

Bases: object

A collection of PredictorDataSet instances.

Parameters:

predictor_config_manager : PredictorConfigManager, optional (default None)

A predictor configuration manager. If None, instantiate a new one.

Attributes

datasets 3-layer deep dict of {data: {trait: {categorical: dataset}}}
dataset(data_name, trait_name, categorical_trait=False, **kwargs)[source]

???? @mlovci please fill in

Parameters:

data_name : str

Name of this data

trait_name : str

Name of this trait

categorical_trait : bool, optional (default=False)

If True, then this trait is treated as a categorical, rather than a sequential trait

Returns:

dataset : PredictorDataSet

???

datasets[source]

3-layer deep dict of {data: {trait: {categorical: dataset}}}

new_dataset(*args, **kwargs)[source]

??? Difference betwen this and dataset??? @mlovci

Parameters:

data_name : str

Name of this data

trait_name : str

Name of this trait

categorical_trait : bool, optional (default=False)

If True, then this trait is treated as a categorical, rather than a sequential trait

data : pandas.DataFrame, optional (default=None)

??? WHy is this optional!?!??!?!

trait : pandas.Series, optional (default=None)

???? Why is this optional!?!?!?

predictor_config_manager : PredictorConfigManager (default=None)

Returns:

dataset : PredictorDataSet

???

class flotilla.compute.predict.Regressor(data_name, trait_name, predictor_name=None, *args, **kwargs)[source]

Bases: flotilla.compute.predict.PredictorBase

Regressor for continuous response variables. A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.
Parameters:

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

categorical = False
flotilla.compute.predict.default_predictor_scoring_fun(cls)[source]

Return scores of how important a feature is to the prediction

Most predictors score output coefficients in the variable cls.feature_importances_ and others may use another name for scores, so this function bridges the gap

Parameters:

cls : sklearn predictor

A scikit-learn prediction class, such as ExtraTreesClassifier or ExtraTreesRegressor

Returns:

scores : pandas.Series

A (n_features,) size series of how important each feature was to the classification (bigger is better)

flotilla.compute.predict.default_score_cutoff_fun(arr, std_multiplier=2)[source]

Calculate a minimum score cutoff for the best features

By default, this function calculates: \(f(x) = mean(x) + 2 * std(x)\)

Parameters:

arr : numpy.ndarray

A numpy array of scores

std_multiplier : float, optional (default=2)

What to multiply the standard deviation by. E.g. if you want only features that are 6 standard deviations away, set this to 6.

Returns:

cutoff : float

Minimum score of “best” features, given these parameters

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.