flotilla.data_model.base module¶

Base data class for all data types. All data types in flotilla inherit from this, or a child object (like ExpressionData).

class flotilla.data_model.base.BaseData(data, thresh=-inf, minimum_samples=0, feature_data=None, feature_rename_col=None, feature_ignore_subset_cols=None, technical_outliers=None, outliers=None, pooled=None, predictor_config_manager=None, data_type=None)[source]¶

Bases: object

Base class for biological data measurements.

All data types in flotilla inherit from this, and have all functionality described here

Attributes

`feature_subsets`	Dict of feature subset names to their list of feature ids
`variant`	Features whose variance is 2 std devs away from mean variance

data	(pandas.DataFrame) A (n_samples, m_features) sized DataFrame of filtered input data, with features with too few samples (`minimum_samples`) detected at `thresh` removed. Compared to `data_original`, ``m_features <= n_features`
data_type	(str) String indicating what kind of data this is, e.g. “splicing” or “expression”
data_original	(pandas.DataFrame) A (n_samples, n_features) sized DataFrame of all input data, before removing features for having too few samples
feature_data	(pandas.DataFrame) A (k_features, n_features_about_features) sized DataFrame of features about the feature data. Notice that this DataFrame does not need to be the same size as the data, but must at least include all the features from `data`. Compared to `data`, `k_features >= m_features`
predictor_config_manager	(PredictorConfigManager) Manage different combinations of predictor on different data subtypes

Methods

maybe_renamed_to_feature_id(feature_id) To be able to give a simple gene name, e.g.

feature_renamer If feature_rename_col is specified in BaseData.__init__(), this will rename the feature ID to a new name. If feature_rename_col is not specified, then this will return the original id

Abstract base class for biological measurements

Parameters:

Parameters:	data : pandas.DataFrame A samples x features (samples on rows, features on columns) dataframe with some kind of measurements of cells, e.g. gene expression values such as TPM, RPKM or FPKM, alternative splicing “Percent-spliced-in” (PSI) values, or RNA editing scores. Note: If the columns are a multi-index, the “level 0” is assumed to be the unique, crazy ID like ‘ENSG00000100320’, and “level 1” is assumed to be the convenient gene name like “RBFOX2” thresh : float, optional (default=-np.inf) Minimum value to accept for this data. minimum_samples : int, optional (default=0) Minimum number of samples with values greater than `thresh`. E.g., for use with “at least 3 single cells expressing the gene at greater than 1 TPM.” feature_data : pandas.DataFrame, optional (default=None) A features x attributes dataframe of metadata about the features, e.g. annotating whether the gene is a housekeeping gene feature_rename_col : str, optional (default=None) Which column in the feature_data to use to rename feature IDs from a crazy ID to a common gene symbol, e.g. to transform ‘ENSG00000100320’ into ‘RBFOX2’ feature_ignore_subset_cols : list-like (default=None) Columns in the feature data to ignore when making subsets, e.g. “gene_name” shouldn’t be used to create subsets, since it’s just a small number of them. technical_outliers : list-like, optional (default=None) List of sample IDs which should be completely ignored because they didn’t pass the technical quality control outliers : list-like, optional (default=None) List of sample IDs which should be marked as outliers for plotting and interpretation purposes pooled : list-like, optional (default=None) List of sample IDs which should be marked as pooled for plotting and interpretation purposes. predictor_config_manager : PredictorConfigManager, optional (default=None) Object used to organize inputs to `compute.predict.Regressor` and `compute.predict.Classifier`. If None, one is initialized for this instance. data_type : str, optional (default=None) A string indicating what kind of data this is, e.g. “expression” or “splicing”

data : pandas.DataFrame

A samples x features (samples on rows, features on columns) dataframe with some kind of measurements of cells, e.g. gene expression values such as TPM, RPKM or FPKM, alternative splicing “Percent-spliced-in” (PSI) values, or RNA editing scores. Note: If the columns are a multi-index, the “level 0” is assumed to be the unique, crazy ID like ‘ENSG00000100320’, and “level 1” is assumed to be the convenient gene name like “RBFOX2”

thresh : float, optional (default=-np.inf)

Minimum value to accept for this data.

minimum_samples : int, optional (default=0)

Minimum number of samples with values greater than thresh. E.g., for use with “at least 3 single cells expressing the gene at greater than 1 TPM.”

feature_data : pandas.DataFrame, optional (default=None)

A features x attributes dataframe of metadata about the features, e.g. annotating whether the gene is a housekeeping gene

feature_rename_col : str, optional (default=None)

Which column in the feature_data to use to rename feature IDs from a crazy ID to a common gene symbol, e.g. to transform ‘ENSG00000100320’ into ‘RBFOX2’

feature_ignore_subset_cols : list-like (default=None)

Columns in the feature data to ignore when making subsets, e.g. “gene_name” shouldn’t be used to create subsets, since it’s just a small number of them.

technical_outliers : list-like, optional (default=None)

List of sample IDs which should be completely ignored because they didn’t pass the technical quality control

outliers : list-like, optional (default=None)

List of sample IDs which should be marked as outliers for plotting and interpretation purposes

pooled : list-like, optional (default=None)

List of sample IDs which should be marked as pooled for plotting and interpretation purposes.

predictor_config_manager : PredictorConfigManager, optional

(default=None) Object used to organize inputs to compute.predict.Regressor and compute.predict.Classifier. If None, one is initialized for this instance.

data_type : str, optional (default=None)

A string indicating what kind of data this is, e.g. “expression” or “splicing”

Notes

Any cells not marked as “technical_outliers”, “outliers” or “pooled” are considered as single-cell samples.

big_nmf_space_transitions(groupby, phenotype_transitions, n=0.5)[source]¶

Get features whose change in NMF space between phenotypes is large

Parameters:

Parameters:	groupby : mappable A sample id to phenotype group mapping phenotype_transitions : list of length-2 tuples of str List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in n : int Minimum number of samples per phenotype, per event
Returns:	big_transitions : pandas.DataFrame A (n_events, n_transitions) dataframe of the NMF distances between splicing events

groupby : mappable

A sample id to phenotype group mapping

phenotype_transitions : list of length-2 tuples of str

List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in

n : int

Minimum number of samples per phenotype, per event

Returns:

big_transitions : pandas.DataFrame

A (n_events, n_transitions) dataframe of the NMF distances between splicing events

binify(data, bins=None)[source]¶

binned_nmf_reduced(*args, **kwargs)[source]¶

classify(trait, sample_ids, feature_ids, standardize=True, data_name='expression', predictor_name='ExtraTreesClassifier', predictor_obj=None, predictor_scoring_fun=None, score_cutoff_fun=None, n_features_dependent_kwargs=None, constant_kwargs=None, plotting_kwargs=None, color=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, bins=None)[source]¶

Make and memoize a predictor on a categorical trait (associated with samples) subset of genes

Parameters:

Parameters:	trait : pandas.Series samples x categorical feature sample_ids : None or list of strings If None, all sample ids will be used, else only the sample ids specified feature_ids : None or list of strings If None, all features will be used, else only the features specified standardize : bool Whether or not to “whiten” (make all variables uncorrelated) and mean-center and make unit-variance all the data via sklearn .preprocessing.StandardScaler predictor : flotilla.visualize.predict classifier Must inherit from flotilla.visualize.PredictorBaseViz. Default is flotilla.visualize.predict.ClassifierViz predictor_kwargs : dict or None Additional ‘keyword arguments’ to supply to the predictor class predictor_scoring_fun : function Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_ score_cutoff_fun : function Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)
Returns:	predictor : flotilla.compute.predict.PredictorBaseViz A ready-to-plot object containing the predictions

trait : pandas.Series

samples x categorical feature

sample_ids : None or list of strings

If None, all sample ids will be used, else only the sample ids specified

feature_ids : None or list of strings

If None, all features will be used, else only the features specified

standardize : bool

Whether or not to “whiten” (make all variables uncorrelated) and mean-center and make unit-variance all the data via sklearn .preprocessing.StandardScaler

predictor : flotilla.visualize.predict classifier

Must inherit from flotilla.visualize.PredictorBaseViz. Default is flotilla.visualize.predict.ClassifierViz

predictor_kwargs : dict or None

Additional ‘keyword arguments’ to supply to the predictor class

predictor_scoring_fun : function

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

Returns:

predictor : flotilla.compute.predict.PredictorBaseViz

A ready-to-plot object containing the predictions

detect_outliers(sample_ids=None, feature_ids=None, featurewise=False, reducer=None, standardize=True, reducer_kwargs=None, bins=None, outlier_detection_method=None, outlier_detection_method_kwargs=None)[source]¶

feature_renamer_series[source]¶: A pandas Series of the original feature ids to the renamed ids

feature_subset_to_feature_ids(feature_subset, rename=True)[source]¶: Convert a feature subset name to a list of feature ids

feature_subsets[source]¶: Dict of feature subset names to their list of feature ids

jsd_2d(groupby=None, n_iter=100, n_bins=10)[source]¶

Mean Jensen-Shannon divergence of features across phenotypes

Parameters:

Parameters:	groupby : mappable A samples to phenotypes mapping n_iter : int Number of bootstrap resampling iterations to perform for the within-group comparisons n_bins : int Number of bins to binify the singles data on
Returns:	jsd_2d : pandas.DataFrame A (n_phenotypes, n_phenotypes) symmetric dataframe of the mean JSD between and within phenotypes

groupby : mappable

A samples to phenotypes mapping

n_iter : int

Number of bootstrap resampling iterations to perform for the within-group comparisons

n_bins : int

Number of bins to binify the singles data on

Returns:

jsd_2d : pandas.DataFrame

A (n_phenotypes, n_phenotypes) symmetric dataframe of the mean JSD between and within phenotypes

jsd_df(groupby=None, n_iter=100, n_bins=10)[source]¶

Jensen-Shannon divergence of features across phenotypes

Parameters:

Parameters:	groupby : mappable A samples to phenotypes mapping n_iter : int Number of bootstrap resampling iterations to perform for the within-group comparisons n_bins : int Number of bins to binify the singles data on
Returns:	jsd_df : pandas.DataFrame A (n_features, n_phenotypes^2) dataframe of the JSD between each feature between and within phenotypes

groupby : mappable

A samples to phenotypes mapping

n_iter : int

Number of bootstrap resampling iterations to perform for the within-group comparisons

n_bins : int

Number of bins to binify the singles data on

Returns:

jsd_df : pandas.DataFrame

A (n_features, n_phenotypes^2) dataframe of the JSD between each feature between and within phenotypes

maybe_renamed_to_feature_id(feature_id)[source]¶

To be able to give a simple gene name, e.g. “RBFOX2” and get the official ENSG ids or MISO ids

Parameters:

Parameters:	feature_id : str The name of a feature ID. Could be either a common gene name, as in what the crazy IDs are `feature_renamer()` to, or
Returns:	feature_id : str or list-like Valid Feature ID(s) that can be used to subset self.data

feature_id : str

The name of a feature ID. Could be either a common gene name, as in what the crazy IDs are feature_renamer() to, or

Returns:

feature_id : str or list-like

Valid Feature ID(s) that can be used to subset self.data

nmf_space_positions(groupby, n=0.5)[source]¶

Calculate NMF-space position of splicing events in phenotype groups

Parameters:

Parameters:	groupby : mappable A sample id to phenotype mapping n : int or float If int, then this is the absolute number of cells that are minimum required to calculate modalities. If a float, then require this fraction of samples to calculate modalities, e.g. if 0.6, then at least 60% of samples must have an event detected for modality detection
Returns:	df : pandas.DataFrame A (n_events, n_groups) dataframe of NMF positions

groupby : mappable

A sample id to phenotype mapping

n : int or float

If int, then this is the absolute number of cells that are minimum required to calculate modalities. If a float, then require this fraction of samples to calculate modalities, e.g. if 0.6, then at least 60% of samples must have an event detected for modality detection

Returns:

df : pandas.DataFrame

A (n_events, n_groups) dataframe of NMF positions

nmf_space_transitions(groupby, phenotype_transitions, n=0.5)[source]¶

Get distance in NMF space of different splicing events

Parameters:

Parameters:	groupby : mappable A sample id to phenotype mapping phenotype_transitions : list of str pairs Which phenotype follows from one to the next, for calculating distances between n : int or float If int, then this is the absolute number of cells that are minimum required to calculate modalities. If a float, then require this fraction of samples to calculate modalities, e.g. if 0.6, then at least 60% of samples must have an event detected for modality detection
Returns:	nmf_space_transitions : pandas.DataFrame A (n_events, n_phenotype_transitions) sized DataFrame of the distances of these events in NMF space

groupby : mappable

A sample id to phenotype mapping

phenotype_transitions : list of str pairs

Which phenotype follows from one to the next, for calculating distances between

n : int or float

If int, then this is the absolute number of cells that are minimum required to calculate modalities. If a float, then require this fraction of samples to calculate modalities, e.g. if 0.6, then at least 60% of samples must have an event detected for modality detection

Returns:

nmf_space_transitions : pandas.DataFrame

A (n_events, n_phenotype_transitions) sized DataFrame of the distances of these events in NMF space

outliers[source]¶: Data from only the outlier samples

plot_big_nmf_space_transitions(phenotype_groupby, phenotype_transitions, phenotype_order, color, phenotype_to_color, phenotype_to_marker, n=0.5)[source]¶

Violinplots and NMF transitions of features different in phenotypes

Plot violinplots and NMF-space transitions of features that have large NMF-space transitions between different phenotypes

Parameters:

Parameters:	n : int Minimum number of samples per phenotype, per event

n : int

Minimum number of samples per phenotype, per event

plot_classifier(trait, sample_ids=None, feature_ids=None, predictor_name=None, standardize=True, score_coefficient=None, data_name=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, color=None, **plotting_kwargs)[source]¶

Classify samples on boolean or categorical traits

Parameters:

Parameters:	trait : pandas.Series A (n_samples,) series of categorical features. Must have the same index as `data` sample_ids : list-like, optional (default=None) Which samples to use to classify feature_ids : list-like, optional (default=None) Which features to use predictor_name : str Name of the predictor to use, in `predictor_config_manager` standardize : bool, optional (default=True) If True, mean-center the data so the mean of all features is 0, and divide by the standard deviation so the standard deviation of all features is 1. This allows us to compare lowly expressed features and highly expressed features on the same playing field data_name : str, optional (default=None) Name for this subset of the data groupby : mappable, optional (default=None) Map each sample id to a group, such as a phenotype label label_to_color : dict, optional (default=None) For each phenotype label, assign a color label_to_marker : dict, optional (default=None) For each phenotype label, assign a plotting marker symbol/shape order : list, optional (default=None) For violinplots, the order of the phenotype groups color : list, optional (default=None) For violinplots, the colors of the phenotypes in their order plotting_kwargs : other keyword arguments All other keyword arguments are passed to `Classifier.__call__()`, which passes them to `DecomopsitionViz.__call__()`
Returns:	cv : ClassifierViz Visualziation of the classifier

trait : pandas.Series

A (n_samples,) series of categorical features. Must have the same index as data

sample_ids : list-like, optional (default=None)

Which samples to use to classify

feature_ids : list-like, optional (default=None)

Which features to use

predictor_name : str

Name of the predictor to use, in predictor_config_manager

standardize : bool, optional (default=True)

If True, mean-center the data so the mean of all features is 0, and divide by the standard deviation so the standard deviation of all features is 1. This allows us to compare lowly expressed features and highly expressed features on the same playing field

data_name : str, optional (default=None)

Name for this subset of the data

groupby : mappable, optional (default=None)

Map each sample id to a group, such as a phenotype label

label_to_color : dict, optional (default=None)

For each phenotype label, assign a color

label_to_marker : dict, optional (default=None)

For each phenotype label, assign a plotting marker symbol/shape

order : list, optional (default=None)

For violinplots, the order of the phenotype groups

color : list, optional (default=None)

For violinplots, the colors of the phenotypes in their order

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to Classifier.__call__(), which passes them to DecomopsitionViz.__call__()

Returns:

cv : ClassifierViz

Visualziation of the classifier

plot_clustermap(sample_ids=None, feature_ids=None, data=None, feature_colors=None, sample_id_to_color=None, metric='euclidean', method='average', norm_features=True, scale_fig_by_data=True, **kwargs)[source]¶

plot_correlations(sample_ids=None, feature_ids=None, data=None, featurewise=False, sample_id_to_color=None, metric='euclidean', method='average', scale_fig_by_data=True, **kwargs)[source]¶

plot_dimensionality_reduction(x_pc=1, y_pc=2, sample_ids=None, feature_ids=None, featurewise=False, reducer=None, plot_violins=False, groupby=None, label_to_color=None, label_to_marker=None, order=None, reduce_kwargs=None, title='', most_variant_features=False, std_multiplier=2, scale_by_variance=True, **plotting_kwargs)[source]¶

Principal component-like analysis of measurements

Parameters:

Parameters:	x_pc : int, optional Which principal component to plot on the x-axis (default 1) y_pc : int, optional Which principal component to plot on the y-axis (default 2) sample_ids : list, optional If None, plot all the samples. If a list of strings, must be valid sample ids of the data. (default None) feature_ids : list, optional If None, plot all the features. If a list of strings, perform and plot dimensionality reduction on only these feature ids featurewise : bool, optional If True, the features are reduced on the samples, and the plotted points are features, not samples. (default False) reducer : `DataFrameReducerBase`, optional Which decomposition object to use. Must be a child of `DataFrameReducerBase` as this has built-in compatibility with pandas.DataFrames. (default=:py:class:.DataFramePCA) plot_violins : bool, optional If True, plot the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA. (default False) groupby : mappable, optional Map each sample id to a group, such as a phenotype label (default None) label_to_color : dict, optional For each phenotype label, assign a color (default None) label_to_marker : dict, optional For each phenotype label, assign a plotting marker symbol/shape (default None) order : list, optional For violinplots, the order of the phenotype groups (default None) reduce_kwargs : dict, optional Keyword arguments to the reducer (default None) title : str, optional Title of the reduced space plot (default ‘’) most_variant_features : bool, optional If True, then only take the most variant of the provided features. The most variant are determined by taking the features whose variance is ``std_multiplier``standard deviations away from the mean feature variance (default False) std_multiplier : float, optional If `most_variant_features` is True, then use this as a cutoff for the minimum variance of a feature to be included (default 2) scale_by_variance : bool, optional If True, then scale the x- and y-axes by the explained variance ratio of the principal component dimensions. Only valid for PCA and its variations, not for NMF or tSNE. (default True) plotting_kwargs : other keyword arguments All other keyword arguments are passed to `DecomopsitionViz.plot()`
Returns:	viz : `DecompositionViz` Object with plotted dimensionality reduction

x_pc : int, optional

Which principal component to plot on the x-axis (default 1)

y_pc : int, optional

Which principal component to plot on the y-axis (default 2)

sample_ids : list, optional

If None, plot all the samples. If a list of strings, must be valid sample ids of the data. (default None)

feature_ids : list, optional

If None, plot all the features. If a list of strings, perform and plot dimensionality reduction on only these feature ids

featurewise : bool, optional

If True, the features are reduced on the samples, and the plotted points are features, not samples. (default False)

reducer : DataFrameReducerBase, optional

Which decomposition object to use. Must be a child of DataFrameReducerBase as this has built-in compatibility with pandas.DataFrames. (default=:py:class:.DataFramePCA)

plot_violins : bool, optional

If True, plot the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA. (default False)

groupby : mappable, optional

Map each sample id to a group, such as a phenotype label (default None)

label_to_color : dict, optional

For each phenotype label, assign a color (default None)

label_to_marker : dict, optional

For each phenotype label, assign a plotting marker symbol/shape (default None)

order : list, optional

For violinplots, the order of the phenotype groups (default None)

reduce_kwargs : dict, optional

Keyword arguments to the reducer (default None)

title : str, optional

Title of the reduced space plot (default ‘’)

most_variant_features : bool, optional

If True, then only take the most variant of the provided features. The most variant are determined by taking the features whose variance is ``std_multiplier``standard deviations away from the mean feature variance (default False)

std_multiplier : float, optional

If most_variant_features is True, then use this as a cutoff for the minimum variance of a feature to be included (default 2)

scale_by_variance : bool, optional

If True, then scale the x- and y-axes by the explained variance ratio of the principal component dimensions. Only valid for PCA and its variations, not for NMF or tSNE. (default True)

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to DecomopsitionViz.plot()

Returns:

viz : DecompositionViz

Object with plotted dimensionality reduction

plot_feature(feature_id, sample_ids=None, phenotype_groupby=None, phenotype_order=None, color=None, phenotype_to_color=None, phenotype_to_marker=None, nmf_xlabel=None, nmf_ylabel=None, nmf_space=False, fig=None, axesgrid=None)[source]¶: Plot the violinplot of a feature. Have the option to show NMF movement

plot_nmf_space_transitions(feature_id, groupby, phenotype_to_color, phenotype_to_marker, order, ax=None, xlabel=None, ylabel=None)[source]¶

plot_outliers(reducer, outlier_detector, **pca_args)[source]¶

plot_pca(**kwargs)[source]¶: Call plot_dimensionality_reduction with PCA specifically

plot_two_features(feature1, feature2, groupby=None, label_to_color=None, fillna=None, **kwargs)[source]¶: Plot the values of two features

plot_two_samples(sample1, sample2, fillna=None, **kwargs)[source]¶

Parameters:

Parameters:	sample1 : str Name of the sample to plot on the x-axis sample2 : str Name of the sample to plot on the y-axis fillna : float Value to replace NAs with Any other keyword arguments valid for seaborn.jointplot
Returns:	jointgrid : seaborn.axisgrid.JointGrid Returns a JointGrid instance See Also seaborn.jointplot

sample1 : str

Name of the sample to plot on the x-axis

sample2 : str

Name of the sample to plot on the y-axis

fillna : float

Value to replace NAs with

Any other keyword arguments valid for seaborn.jointplot

Returns:

jointgrid : seaborn.axisgrid.JointGrid

Returns a JointGrid instance