flotilla.compute.predict module

Compute predictors on data, e.g. classify or regress on features/samples

class flotilla.compute.predict.Classifier(data_name, trait_name, predictor_name=None, *args, **kwargs)[source]

Bases: flotilla.compute.predict.PredictorBase

Classifier for categorical response variables. A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

categorical = True
class flotilla.compute.predict.ConfigOptimizer[source]

Bases: object

choose the coef that makes some result most likely at all n_features (or some other function of the dataset)

static objective_average_times_seen(n_features, coef=2.5, max_feature_scaler=<function max_feature_scaler at 0x2ba837697cf8>, n_estimators_scaler=<function n_estimators_scaler at 0x2ba837697d70>)[source]

I have no idea what this does. @mlovci


n_features : int


coef : float


max_feature_scaler : function


n_estimators_scaler : function




class flotilla.compute.predict.PredictorBase(predictor_name, data_name, trait_name, X_data=None, trait=None, predictor_obj=None, predictor_scoring_fun=None, score_cutoff_fun=None, n_features_dependent_kwargs=None, constant_kwargs=None, is_categorical_trait=None, predictor_dataset_manager=None, predictor_config_manager=None, feature_renamer=None, groupby=None, color=None, pooled=None, order=None, violinplot_kws=None, data_type=None, label_to_color=None, label_to_marker=None, singles=None, outliers=None)[source]

Bases: object

A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.


predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}


Predictive variables, aligned with target.

Thin reference to dataset.X


Thin reference to dataset


Fit predictor to the dataset


Thin reference to predictor.has_been_fit


Thin reference to predictor.has_been_scored


Get all features with scores greater than score_cutoff_


Get the number of good features

nmf(*args, **kwargs)[source]

Perform NMF on the top-performing features


Thin reference to predictor.oob_score_

pca(*args, **kwargs)[source]

Perform PCA on the top-performing features

predict(*args, **kwargs)[source]



other : pandas.DataFrame

Given a (m_samples, n_features) dataframe, predict the response


prediction : pandas.Series

(m_samples,) sized series of prediction of response



If other is not a pandas DataFrame


Thin reference to dataset.predictor


Thin reference to predictor._score_coefficient


Get the minimum score of the ‘good’ features


Scores of these features’ importances in this predictor


Get the subset of the data with only important features


Target variable, aligned with predictive variables

Thin reference to dataset.y

class flotilla.compute.predict.PredictorConfig(predictor_name, obj, predictor_scoring_fun=<function default_predictor_scoring_fun at 0x2ba83768dc80>, score_cutoff_fun=<function default_score_cutoff_fun at 0x2ba837697a28>, n_features_dependent_kwargs=None, **kwargs)[source]

Bases: object

A configuration for a predictor, names and tracks/sets parameters

Dynamically configures some args for predictor based on n_features (if this attribute exists) set general parameters with __init__ yield instances, set by your parameters, with __call__

Construct a predictor configuration


predictor_name : str

A name for this predictor

obj : sklearn predictor

A scikit-learn predictor, eg sklearn.ensemble.ExtraTreesClassifier

predictor_scoring_fun : function, optional

A function which returns the scores of a predictor. May be different for different predictor objects

score_cutoff_fun : function, optional

A function which returns the minimum “good” score of a predictor

n_features_dependent_kwargs : dict, optional (default None)

A dictionary of (key, function) arguments for the classifier, for keyword arguments that are dependent on the dataset input size

kwargs : other keyword arguments, optional

All other keyword arguments are passed along to the predictor

parameters(*args, **kwargs)[source]

Given a number of features, return the appropriately scaled keyword arguments


n_features : int

Number of features in the data to scale appropriate keyword arguments to the predictor object

class flotilla.compute.predict.PredictorConfigManager[source]

Bases: object

Manage several predictor configurations

A container for predictor configurations, includes several built-ins @mlovci: built-ins such as ........ ? What is predictor_config vs new_predictor_config? Why are they separate?


predictor_config :  
predictor_configs :  
builtin_predictor_configs :  


new_predictor_config(*args, **kwargs) Create a new predictor configuration
>>> pcm = PredictorConfigManager()
>>> # add a new type of predictor
>>> pcm.new_predictor_config(ExtraTreesClassifier, 'ExtraTreesClassifier',
... n_features_dependent_kwargs=  
... {‘max_features’: PredictorConfigScalers.max_feature_scaler,  
... ‘n_estimators’: PredictorConfigScalers.n_estimators_scaler,  
... ‘n_jobs’: PredictorConfigScalers.n_jobs_scaler},  
... bootstrap=True, random_state=0,  
... oob_score=True,  
... verbose=True})  

Construct a predictor configuration manager with ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier, and GradientBoostingRegressor as default predictors.


Names of the predictor configurations

new_predictor_config(*args, **kwargs)[source]

Create a new predictor configuration


name : str

Name of the predictor configuration

obj : sklearn predictor object, optional (default=None)

@mlovci: what is the point of setting the default to None if it’s not really allowed?

predictor_scoring_fun : function, optional (default=None)

If None, get feature scores from obj.feature_importances_

score_cutoff_fun : function, optional (default=None)

If None, get the cutoff for important features with by taking features with scores that are 2 standard deviations away from the mean score

n_features_dependent_kwargs : dict, optional (default=None)

A (key, function) dictionary of keyword argument names and functions which scale their values based on the dataset input size

kwargs : other keyword arguments

All other keyword arguments are passed to PredictorConfig


predictorconfig : PredictorConfig

A predictor configuration



If obj is None and any of the other keyword arguments are None


If obj is None and “name” is not already in predictor_configs

predictor_config(name, **kwargs)[source]

Create a new predictor configuration, added to predictors


name : str

Name of the predictor

kwargs : other keyword arguments, optional

All other keyword arguments are passed to predictor_configs()


predictor : sklearn predictor

An initalized scikit-learn predictor


Dict of predictor configurations

class flotilla.compute.predict.PredictorConfigScalers[source]

Bases: object

Scale parameters specified in the keyword arugments based on the dataset size

static max_feature_scaler(n_features=500, coef=2.5)[source]

Scale the maximum number of features per estimator

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?


n_features : int, optional (default 500)

Number of features in the data

coef : float, optional (default 2.5)

# TODO: What does this do?


n_features : int

Maximum number of features per estimator



If n_features is None

static n_estimators_scaler(n_features=500, coef=2.5)[source]

Scale the number of estimators based on the input features

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?


n_features : int, optional (default 500)

Number of features in the data

coef : float, optional (default 2.5)

# @mlovci TODO: What does this do?


n_estimators : int

Number of estimators to use



If n_features is None

static n_jobs_scaler(n_features=500)[source]

Scale the number of jobs based on how many features are in the data

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?


n_features : int

Number of features in the data


n_jobs : int

Number of jobs to use



If n_features is None

class flotilla.compute.predict.PredictorDataSet(data, trait, data_name='MyDataset', categorical_trait=False, predictor_config_manager=None)[source]

Bases: object

Store a (n_samples, n_features) matrix and (n_samples,) trait pair

In scikit-learn parlance, store an X (data of independent variables) and y (target prediction) pair


data : pandas.DataFrame

A (n_samples, n_features) datafarme

trait : pandas.Series


data - X

trait - y data_name - name to store this dataset, to be used with trait.name categorical_trait - is y categorical?


(n_samples, n_features) matrix

check_if_equal(data, trait, categorical_trait)[source]

Check if this is the same as another dataset.


data : pandas.DataFrame

Input data of another dataset

trait : pandas.Series

Response variable of another dataset

categorical_trait : bool

Whether or not trait is categorical



If datasets are not the same

predictor(*args, **kwargs)[source]

A single, initialized PredictorConfig instance


name : str

Name of the predictor to retrieve or initialize

kwargs : other keyword arguments

All other keyword arguments are passed to PredictorConfig


predictorconfig : PredictorConfig

An initialized scikit-learn classifier or regressor


dict of PredictorConfig instances

The idea here is to keep the predictors tied to their datasets


All unique values in self.trait


(n_samples,) vector of traits

class flotilla.compute.predict.PredictorDataSetManager(predictor_config_manager=None)[source]

Bases: object

A collection of PredictorDataSet instances.


predictor_config_manager : PredictorConfigManager, optional (default None)

A predictor configuration manager. If None, instantiate a new one.


datasets 3-layer deep dict of {data: {trait: {categorical: dataset}}}
dataset(data_name, trait_name, categorical_trait=False, **kwargs)[source]

???? @mlovci please fill in


data_name : str

Name of this data

trait_name : str

Name of this trait

categorical_trait : bool, optional (default=False)

If True, then this trait is treated as a categorical, rather than a sequential trait


dataset : PredictorDataSet



3-layer deep dict of {data: {trait: {categorical: dataset}}}

new_dataset(*args, **kwargs)[source]

??? Difference betwen this and dataset??? @mlovci


data_name : str

Name of this data

trait_name : str

Name of this trait

categorical_trait : bool, optional (default=False)

If True, then this trait is treated as a categorical, rather than a sequential trait

data : pandas.DataFrame, optional (default=None)

??? WHy is this optional!?!??!?!

trait : pandas.Series, optional (default=None)

???? Why is this optional!?!?!?

predictor_config_manager : PredictorConfigManager (default=None)


dataset : PredictorDataSet


class flotilla.compute.predict.Regressor(data_name, trait_name, predictor_name=None, *args, **kwargs)[source]

Bases: flotilla.compute.predict.PredictorBase

Regressor for continuous response variables. A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

categorical = False

Return scores of how important a feature is to the prediction

Most predictors score output coefficients in the variable cls.feature_importances_ and others may use another name for scores, so this function bridges the gap


cls : sklearn predictor

A scikit-learn prediction class, such as ExtraTreesClassifier or ExtraTreesRegressor


scores : pandas.Series

A (n_features,) size series of how important each feature was to the classification (bigger is better)

flotilla.compute.predict.default_score_cutoff_fun(arr, std_multiplier=2)[source]

Calculate a minimum score cutoff for the best features

By default, this function calculates: \(f(x) = mean(x) + 2 * std(x)\)


arr : numpy.ndarray

A numpy array of scores

std_multiplier : float, optional (default=2)

What to multiply the standard deviation by. E.g. if you want only features that are 6 standard deviations away, set this to 6.


cutoff : float

Minimum score of “best” features, given these parameters

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.