flotilla.compute.predict module¶

Compute predictors on data, e.g. classify or regress on features/samples

class flotilla.compute.predict.Classifier(data_name, trait_name, predictor_name=None, *args, **kwargs)[source]¶

Bases: flotilla.compute.predict.PredictorBase

Classifier for categorical response variables. A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.

Parameters:

Parameters:	predictor_name : str Name for predictor data_name : str Name for this (subset of the) data trait_name : str Name for this trait X_data : pandas.DataFrame, optional Samples-by-features (row x col) dataset to train the predictor on trait : pandas.Series, optional A variable you want to predict using X_data. Indexed like X_data. predictor_obj : sklearn predictor, optional A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier predictor_scoring_fun : function, optional Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_ score_cutoff_fun : function, optional Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x) n_features_dependent_kwargs : dict, optional kwargs to the predictor that depend on n_features Default: {} constant_kwargs : dict, optional kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

categorical = True¶

class flotilla.compute.predict.ConfigOptimizer[source]¶

Bases: object

choose the coef that makes some result most likely at all n_features (or some other function of the dataset)

static objective_average_times_seen(n_features, coef=2.5, max_feature_scaler=<function max_feature_scaler at 0x2b5e9dd3b7d0>, n_estimators_scaler=<function n_estimators_scaler at 0x2b5e9dd3b848>)[source]¶

I have no idea what this does. @mlovci

Parameters:

Parameters:	n_features : int ??? coef : float ??? max_feature_scaler : function ??? n_estimators_scaler : function ???
Returns:	???

n_features : int

???

coef : float

???

max_feature_scaler : function

???

n_estimators_scaler : function

???

Returns:

???

class flotilla.compute.predict.PredictorBase(predictor_name, data_name, trait_name, X_data=None, trait=None, predictor_obj=None, predictor_scoring_fun=None, score_cutoff_fun=None, n_features_dependent_kwargs=None, constant_kwargs=None, is_categorical_trait=None, predictor_dataset_manager=None, predictor_config_manager=None, feature_renamer=None, groupby=None, color=None, pooled=None, order=None, violinplot_kws=None, data_type=None, label_to_color=None, label_to_marker=None, singles=None, outliers=None)[source]¶

Bases: object

A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.

Parameters:

Parameters:	predictor_name : str Name for predictor data_name : str Name for this (subset of the) data trait_name : str Name for this trait X_data : pandas.DataFrame, optional Samples-by-features (row x col) dataset to train the predictor on trait : pandas.Series, optional A variable you want to predict using X_data. Indexed like X_data. predictor_obj : sklearn predictor, optional A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier predictor_scoring_fun : function, optional Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_ score_cutoff_fun : function, optional Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x) n_features_dependent_kwargs : dict, optional kwargs to the predictor that depend on n_features Default: {} constant_kwargs : dict, optional kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

X[source]¶

Predictive variables, aligned with target.

Thin reference to dataset.X

dataset[source]¶: Thin reference to dataset

fit()[source]¶: Fit predictor to the dataset

has_been_fit[source]¶: Thin reference to predictor.has_been_fit

has_been_scored[source]¶: Thin reference to predictor.has_been_scored

important_features_[source]¶: Get all features with scores greater than score_cutoff_

n_good_features_[source]¶: Get the number of good features

nmf(*args, **kwargs)[source]¶: Perform NMF on the top-performing features

oob_score_[source]¶: Thin reference to predictor.oob_score_

pca(*args, **kwargs)[source]¶: Perform PCA on the top-performing features

predict(*args, **kwargs)[source]¶

Predict

Parameters:

Parameters:	other : pandas.DataFrame Given a (m_samples, n_features) dataframe, predict the response
Returns:	prediction : pandas.Series (m_samples,) sized series of prediction of response
Raises:	TypeError If `other` is not a pandas DataFrame

other : pandas.DataFrame

Given a (m_samples, n_features) dataframe, predict the response

Returns:

prediction : pandas.Series

(m_samples,) sized series of prediction of response

Raises:

TypeError

If other is not a pandas DataFrame

predictor[source]¶: Thin reference to dataset.predictor

score_coefficient[source]¶: Thin reference to predictor._score_coefficient

score_cutoff_[source]¶: Get the minimum score of the ‘good’ features

scores_[source]¶: Scores of these features’ importances in this predictor

subset_[source]¶: Get the subset of the data with only important features

y[source]¶

Target variable, aligned with predictive variables

Thin reference to dataset.y

class flotilla.compute.predict.PredictorConfig(predictor_name, obj, predictor_scoring_fun=<function default_predictor_scoring_fun at 0x2b5e9d900758>, score_cutoff_fun=<function default_score_cutoff_fun at 0x2b5e9dd3b500>, n_features_dependent_kwargs=None, **kwargs)[source]¶

Bases: object

A configuration for a predictor, names and tracks/sets parameters

Dynamically configures some args for predictor based on n_features (if this attribute exists) set general parameters with __init__ yield instances, set by your parameters, with __call__

Construct a predictor configuration

Parameters:

Parameters:	predictor_name : str A name for this predictor obj : sklearn predictor A scikit-learn predictor, eg sklearn.ensemble.ExtraTreesClassifier predictor_scoring_fun : function, optional A function which returns the scores of a predictor. May be different for different predictor objects score_cutoff_fun : function, optional A function which returns the minimum “good” score of a predictor n_features_dependent_kwargs : dict, optional (default None) A dictionary of (key, function) arguments for the classifier, for keyword arguments that are dependent on the dataset input size kwargs : other keyword arguments, optional All other keyword arguments are passed along to the predictor

predictor_name : str

A name for this predictor

obj : sklearn predictor

A scikit-learn predictor, eg sklearn.ensemble.ExtraTreesClassifier

predictor_scoring_fun : function, optional

A function which returns the scores of a predictor. May be different for different predictor objects

score_cutoff_fun : function, optional

A function which returns the minimum “good” score of a predictor

n_features_dependent_kwargs : dict, optional (default None)

A dictionary of (key, function) arguments for the classifier, for keyword arguments that are dependent on the dataset input size

kwargs : other keyword arguments, optional

All other keyword arguments are passed along to the predictor

parameters(*args, **kwargs)[source]¶

Given a number of features, return the appropriately scaled keyword arguments

Parameters:

Parameters:	n_features : int Number of features in the data to scale appropriate keyword arguments to the predictor object

n_features : int

Number of features in the data to scale appropriate keyword arguments to the predictor object

class flotilla.compute.predict.PredictorConfigManager[source]¶

Bases: object

Manage several predictor configurations

A container for predictor configurations, includes several built-ins @mlovci: built-ins such as ........ ? What is predictor_config vs new_predictor_config? Why are they separate?

Attributes

predictor_config :
predictor_configs :
builtin_predictor_configs :

Methods

new_predictor_config(*args, **kwargs) Create a new predictor configuration

>>> pcm = PredictorConfigManager()
>>> # add a new type of predictor
>>> pcm.new_predictor_config(ExtraTreesClassifier, 'ExtraTreesClassifier',
... n_features_dependent_kwargs=
... {‘max_features’: PredictorConfigScalers.max_feature_scaler,
... ‘n_estimators’: PredictorConfigScalers.n_estimators_scaler,
... ‘n_jobs’: PredictorConfigScalers.n_jobs_scaler},
... bootstrap=True, random_state=0,
... oob_score=True,
... verbose=True})

Construct a predictor configuration manager with ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier, and GradientBoostingRegressor as default predictors.

builtin_predictor_configs[source]¶: Names of the predictor configurations

new_predictor_config(*args, **kwargs)[source]¶

Create a new predictor configuration

Parameters:

Parameters:	name : str Name of the predictor configuration obj : sklearn predictor object, optional (default=None) @mlovci: what is the point of setting the default to None if it’s not really allowed? predictor_scoring_fun : function, optional (default=None) If None, get feature scores from obj.feature_importances_ score_cutoff_fun : function, optional (default=None) If None, get the cutoff for important features with by taking features with scores that are 2 standard deviations away from the mean score n_features_dependent_kwargs : dict, optional (default=None) A (key, function) dictionary of keyword argument names and functions which scale their values based on the dataset input size kwargs : other keyword arguments All other keyword arguments are passed to `PredictorConfig`
Returns:	predictorconfig : PredictorConfig A predictor configuration
Raises:	ValueError If obj is None and any of the other keyword arguments are None KeyError If obj is None and “name” is not already in `predictor_configs`

name : str

Name of the predictor configuration

obj : sklearn predictor object, optional (default=None)

@mlovci: what is the point of setting the default to None if it’s not really allowed?

predictor_scoring_fun : function, optional (default=None)

If None, get feature scores from obj.feature_importances_

score_cutoff_fun : function, optional (default=None)

If None, get the cutoff for important features with by taking features with scores that are 2 standard deviations away from the mean score

n_features_dependent_kwargs : dict, optional (default=None)

A (key, function) dictionary of keyword argument names and functions which scale their values based on the dataset input size

kwargs : other keyword arguments

All other keyword arguments are passed to PredictorConfig

Returns:

predictorconfig : PredictorConfig

A predictor configuration

Raises:

ValueError

If obj is None and any of the other keyword arguments are None

KeyError

If obj is None and “name” is not already in predictor_configs

predictor_config(name, **kwargs)[source]¶

Create a new predictor configuration, added to predictors

Parameters:

Parameters:	name : str Name of the predictor kwargs : other keyword arguments, optional All other keyword arguments are passed to `predictor_configs()`
Returns:	predictor : sklearn predictor An initalized scikit-learn predictor

name : str

Name of the predictor

kwargs : other keyword arguments, optional

All other keyword arguments are passed to predictor_configs()

Returns:

predictor : sklearn predictor

An initalized scikit-learn predictor

predictor_configs[source]¶: Dict of predictor configurations

class flotilla.compute.predict.PredictorConfigScalers[source]¶

Bases: object

Scale parameters specified in the keyword arugments based on the dataset size

static max_feature_scaler(n_features=500, coef=2.5)[source]¶

Scale the maximum number of features per estimator

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?

Parameters:

Parameters:	n_features : int, optional (default 500) Number of features in the data coef : float, optional (default 2.5) # TODO: What does this do?
Returns:	n_features : int Maximum number of features per estimator
Raises:	ValueError If n_features is None

n_features : int, optional (default 500)

Number of features in the data

coef : float, optional (default 2.5)

# TODO: What does this do?

Returns:

n_features : int

Maximum number of features per estimator

Raises:

ValueError

If n_features is None

static n_estimators_scaler(n_features=500, coef=2.5)[source]¶

Scale the number of estimators based on the input features

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?

Parameters:

Parameters:	n_features : int, optional (default 500) Number of features in the data coef : float, optional (default 2.5) # @mlovci TODO: What does this do?
Returns:	n_estimators : int Number of estimators to use
Raises:	ValueError If n_features is None

n_features : int, optional (default 500)

Number of features in the data

coef : float, optional (default 2.5)

# @mlovci TODO: What does this do?

Returns:

n_estimators : int

Number of estimators to use

Raises:

ValueError

If n_features is None

static n_jobs_scaler(n_features=500)[source]¶

Scale the number of jobs based on how many features are in the data

# TODO: @mlovci what are the principles behind this scaler? to see each feature “x” number of times?

Parameters:

Parameters:	n_features : int Number of features in the data
Returns:	n_jobs : int Number of jobs to use
Raises:	ValueError If n_features is None

n_features : int

Number of features in the data

Returns:

n_jobs : int

Number of jobs to use

Raises:

ValueError

If n_features is None

class flotilla.compute.predict.PredictorDataSet(data, trait, data_name='MyDataset', categorical_trait=False, predictor_config_manager=None)[source]¶

Bases: object

Store a (n_samples, n_features) matrix and (n_samples,) trait pair

In scikit-learn parlance, store an X (data of independent variables) and y (target prediction) pair

Parameters:

Parameters:	data : pandas.DataFrame A (n_samples, n_features) datafarme trait : pandas.Series
Raises:	data - X trait - y data_name - name to store this dataset, to be used with trait.name categorical_trait - is y categorical?

data : pandas.DataFrame

A (n_samples, n_features) datafarme

trait : pandas.Series

Raises:

data - X

trait - y data_name - name to store this dataset, to be used with trait.name categorical_trait - is y categorical?

X[source]¶: (n_samples, n_features) matrix

check_if_equal(data, trait, categorical_trait)[source]¶

Check if this is the same as another dataset.

Parameters:

Parameters:	data : pandas.DataFrame Input data of another dataset trait : pandas.Series Response variable of another dataset categorical_trait : bool Whether or not `trait` is categorical
Raises:	AssertionError If datasets are not the same

data : pandas.DataFrame

Input data of another dataset

trait : pandas.Series

Response variable of another dataset

categorical_trait : bool

Whether or not trait is categorical

Raises:

AssertionError

If datasets are not the same

predictor(*args, **kwargs)[source]¶

A single, initialized PredictorConfig instance

Parameters:

Parameters:	name : str Name of the predictor to retrieve or initialize kwargs : other keyword arguments All other keyword arguments are passed to `PredictorConfig`
Returns:	predictorconfig : PredictorConfig An initialized scikit-learn classifier or regressor

name : str

Name of the predictor to retrieve or initialize

kwargs : other keyword arguments

All other keyword arguments are passed to PredictorConfig

Returns:

predictorconfig : PredictorConfig

An initialized scikit-learn classifier or regressor

predictors[source]¶

dict of PredictorConfig instances

The idea here is to keep the predictors tied to their datasets

traitset[source]¶: All unique values in self.trait

y[source]¶: (n_samples,) vector of traits

class flotilla.compute.predict.PredictorDataSetManager(predictor_config_manager=None)[source]¶

Bases: object

A collection of PredictorDataSet instances.

Parameters:

Parameters:	predictor_config_manager : PredictorConfigManager, optional (default None) A predictor configuration manager. If None, instantiate a new one.

predictor_config_manager : PredictorConfigManager, optional (default None)

A predictor configuration manager. If None, instantiate a new one.

Attributes

datasets 3-layer deep dict of {data: {trait: {categorical: dataset}}}

dataset(data_name, trait_name, categorical_trait=False, **kwargs)[source]¶

???? @mlovci please fill in

Parameters:

Parameters:	data_name : str Name of this data trait_name : str Name of this trait categorical_trait : bool, optional (default=False) If True, then this trait is treated as a categorical, rather than a sequential trait
Returns:	dataset : PredictorDataSet ???

data_name : str

Name of this data

trait_name : str

Name of this trait

categorical_trait : bool, optional (default=False)

If True, then this trait is treated as a categorical, rather than a sequential trait

Returns:

dataset : PredictorDataSet

???

datasets[source]¶: 3-layer deep dict of {data: {trait: {categorical: dataset}}}

new_dataset(*args, **kwargs)[source]¶

??? Difference betwen this and dataset??? @mlovci

Parameters:

Parameters:	data_name : str Name of this data trait_name : str Name of this trait categorical_trait : bool, optional (default=False) If True, then this trait is treated as a categorical, rather than a sequential trait data : pandas.DataFrame, optional (default=None) ??? WHy is this optional!?!??!?! trait : pandas.Series, optional (default=None) ???? Why is this optional!?!?!? predictor_config_manager : PredictorConfigManager (default=None)
Returns:	dataset : PredictorDataSet ???

data_name : str

Name of this data

trait_name : str

Name of this trait

categorical_trait : bool, optional (default=False)

If True, then this trait is treated as a categorical, rather than a sequential trait

data : pandas.DataFrame, optional (default=None)

??? WHy is this optional!?!??!?!

trait : pandas.Series, optional (default=None)

???? Why is this optional!?!?!?

predictor_config_manager : PredictorConfigManager (default=None)

Returns:

dataset : PredictorDataSet

???

class flotilla.compute.predict.Regressor(data_name, trait_name, predictor_name=None, *args, **kwargs)[source]¶

Bases: flotilla.compute.predict.PredictorBase

Regressor for continuous response variables. A dataset-predictor pair from PredictorDatasetManager

One datset, one predictor, from dataset manager.

Parameters:

Parameters:	predictor_name : str Name for predictor data_name : str Name for this (subset of the) data trait_name : str Name for this trait X_data : pandas.DataFrame, optional Samples-by-features (row x col) dataset to train the predictor on trait : pandas.Series, optional A variable you want to predict using X_data. Indexed like X_data. predictor_obj : sklearn predictor, optional A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier predictor_scoring_fun : function, optional Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_ score_cutoff_fun : function, optional Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x) n_features_dependent_kwargs : dict, optional kwargs to the predictor that depend on n_features Default: {} constant_kwargs : dict, optional kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

predictor_name : str

Name for predictor

data_name : str

Name for this (subset of the) data

trait_name : str

Name for this trait

X_data : pandas.DataFrame, optional

Samples-by-features (row x col) dataset to train the predictor on

trait : pandas.Series, optional

A variable you want to predict using X_data. Indexed like X_data.

predictor_obj : sklearn predictor, optional

A scikit-learn predictor that implements fit and score on (X_data,trait) Default ExtraTreesClassifier

predictor_scoring_fun : function, optional

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function, optional

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

n_features_dependent_kwargs : dict, optional

kwargs to the predictor that depend on n_features Default: {}

constant_kwargs : dict, optional

kwargs to the predictor that are constant, i.e.: {‘n_estimators’: 100, ‘bootstrap’: True, ‘max_features’: ‘auto’, ‘random_state’: 0, ‘oob_score’: True, ‘n_jobs’: 2, ‘verbose’: True}

categorical = False¶

flotilla.compute.predict.default_predictor_scoring_fun(cls)[source]¶

Return scores of how important a feature is to the prediction

Most predictors score output coefficients in the variable cls.feature_importances_ and others may use another name for scores, so this function bridges the gap

Parameters:

Parameters:	cls : sklearn predictor A scikit-learn prediction class, such as ExtraTreesClassifier or ExtraTreesRegressor
Returns:	scores : pandas.Series A (n_features,) size series of how important each feature was to the classification (bigger is better)

cls : sklearn predictor

A scikit-learn prediction class, such as ExtraTreesClassifier or ExtraTreesRegressor

Returns:

scores : pandas.Series

A (n_features,) size series of how important each feature was to the classification (bigger is better)

flotilla.compute.predict.default_score_cutoff_fun(arr, std_multiplier=2)[source]¶

Calculate a minimum score cutoff for the best features

By default, this function calculates: \(f(x) = mean(x) + 2 * std(x)\)

Parameters:

Parameters:	arr : numpy.ndarray A numpy array of scores std_multiplier : float, optional (default=2) What to multiply the standard deviation by. E.g. if you want only features that are 6 standard deviations away, set this to 6.
Returns:	cutoff : float Minimum score of “best” features, given these parameters

arr : numpy.ndarray

A numpy array of scores

std_multiplier : float, optional (default=2)

What to multiply the standard deviation by. E.g. if you want only features that are 6 standard deviations away, set this to 6.

Returns:

cutoff : float

Minimum score of “best” features, given these parameters