flotilla.data_model.base module

Base data class for all data types. All data types in flotilla inherit from this, or a child object (like ExpressionData).

class flotilla.data_model.base.BaseData(data, thresh=-inf, minimum_samples=0, feature_data=None, feature_rename_col=None, feature_ignore_subset_cols=None, technical_outliers=None, outliers=None, pooled=None, predictor_config_manager=None, data_type=None)[source]

Bases: object

Base class for biological data measurements.

All data types in flotilla inherit from this, and have all functionality described here

Attributes

feature_subsets Dict of feature subset names to their list of feature ids
variant Features whose variance is 2 std devs away from mean variance
data (pandas.DataFrame) A (n_samples, m_features) sized DataFrame of filtered input data, with features with too few samples (minimum_samples) detected at thresh removed. Compared to data_original, ``m_features <= n_features`
data_type (str) String indicating what kind of data this is, e.g. “splicing” or “expression”
data_original (pandas.DataFrame) A (n_samples, n_features) sized DataFrame of all input data, before removing features for having too few samples
feature_data (pandas.DataFrame) A (k_features, n_features_about_features) sized DataFrame of features about the feature data. Notice that this DataFrame does not need to be the same size as the data, but must at least include all the features from data. Compared to data, k_features >= m_features
predictor_config_manager (PredictorConfigManager) Manage different combinations of predictor on different data subtypes

Methods

maybe_renamed_to_feature_id(feature_id) To be able to give a simple gene name, e.g.
feature_renamer If feature_rename_col is specified in BaseData.__init__(), this will rename the feature ID to a new name. If feature_rename_col is not specified, then this will return the original id

Abstract base class for biological measurements

Parameters:

data : pandas.DataFrame

A samples x features (samples on rows, features on columns) dataframe with some kind of measurements of cells, e.g. gene expression values such as TPM, RPKM or FPKM, alternative splicing “Percent-spliced-in” (PSI) values, or RNA editing scores. Note: If the columns are a multi-index, the “level 0” is assumed to be the unique, crazy ID like ‘ENSG00000100320’, and “level 1” is assumed to be the convenient gene name like “RBFOX2”

thresh : float, optional (default=-np.inf)

Minimum value to accept for this data.

minimum_samples : int, optional (default=0)

Minimum number of samples with values greater than thresh. E.g., for use with “at least 3 single cells expressing the gene at greater than 1 TPM.”

feature_data : pandas.DataFrame, optional (default=None)

A features x attributes dataframe of metadata about the features, e.g. annotating whether the gene is a housekeeping gene

feature_rename_col : str, optional (default=None)

Which column in the feature_data to use to rename feature IDs from a crazy ID to a common gene symbol, e.g. to transform ‘ENSG00000100320’ into ‘RBFOX2’

feature_ignore_subset_cols : list-like (default=None)

Columns in the feature data to ignore when making subsets, e.g. “gene_name” shouldn’t be used to create subsets, since it’s just a small number of them.

technical_outliers : list-like, optional (default=None)

List of sample IDs which should be completely ignored because they didn’t pass the technical quality control

outliers : list-like, optional (default=None)

List of sample IDs which should be marked as outliers for plotting and interpretation purposes

pooled : list-like, optional (default=None)

List of sample IDs which should be marked as pooled for plotting and interpretation purposes.

predictor_config_manager : PredictorConfigManager, optional

(default=None) Object used to organize inputs to compute.predict.Regressor and compute.predict.Classifier. If None, one is initialized for this instance.

data_type : str, optional (default=None)

A string indicating what kind of data this is, e.g. “expression” or “splicing”

Notes

Any cells not marked as “technical_outliers”, “outliers” or “pooled” are considered as single-cell samples.

big_nmf_space_transitions(groupby, phenotype_transitions, n=0.5)[source]

Get features whose change in NMF space between phenotypes is large

Parameters:

groupby : mappable

A sample id to phenotype group mapping

phenotype_transitions : list of length-2 tuples of str

List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in

n : int

Minimum number of samples per phenotype, per event

Returns:

big_transitions : pandas.DataFrame

A (n_events, n_transitions) dataframe of the NMF distances between splicing events

binify(data, bins=None)[source]
binned_nmf_reduced(*args, **kwargs)[source]
classify(trait, sample_ids, feature_ids, standardize=True, data_name='expression', predictor_name='ExtraTreesClassifier', predictor_obj=None, predictor_scoring_fun=None, score_cutoff_fun=None, n_features_dependent_kwargs=None, constant_kwargs=None, plotting_kwargs=None, color=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, bins=None)[source]

Make and memoize a predictor on a categorical trait (associated with samples) subset of genes

Parameters:

trait : pandas.Series

samples x categorical feature

sample_ids : None or list of strings

If None, all sample ids will be used, else only the sample ids specified

feature_ids : None or list of strings

If None, all features will be used, else only the features specified

standardize : bool

Whether or not to “whiten” (make all variables uncorrelated) and mean-center and make unit-variance all the data via sklearn .preprocessing.StandardScaler

predictor : flotilla.visualize.predict classifier

Must inherit from flotilla.visualize.PredictorBaseViz. Default is flotilla.visualize.predict.ClassifierViz

predictor_kwargs : dict or None

Additional ‘keyword arguments’ to supply to the predictor class

predictor_scoring_fun : function

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

Returns:

predictor : flotilla.compute.predict.PredictorBaseViz

A ready-to-plot object containing the predictions

detect_outliers(sample_ids=None, feature_ids=None, featurewise=False, reducer=None, standardize=True, reducer_kwargs=None, bins=None, outlier_detection_method=None, outlier_detection_method_kwargs=None)[source]
feature_renamer_series[source]

A pandas Series of the original feature ids to the renamed ids

feature_subset_to_feature_ids(feature_subset, rename=True)[source]

Convert a feature subset name to a list of feature ids

feature_subsets[source]

Dict of feature subset names to their list of feature ids

jsd_2d(groupby=None, n_iter=100, n_bins=10)[source]

Mean Jensen-Shannon divergence of features across phenotypes

Parameters:

groupby : mappable

A samples to phenotypes mapping

n_iter : int

Number of bootstrap resampling iterations to perform for the within-group comparisons

n_bins : int

Number of bins to binify the singles data on

Returns:

jsd_2d : pandas.DataFrame

A (n_phenotypes, n_phenotypes) symmetric dataframe of the mean JSD between and within phenotypes

jsd_df(groupby=None, n_iter=100, n_bins=10)[source]

Jensen-Shannon divergence of features across phenotypes

Parameters:

groupby : mappable

A samples to phenotypes mapping

n_iter : int

Number of bootstrap resampling iterations to perform for the within-group comparisons

n_bins : int

Number of bins to binify the singles data on

Returns:

jsd_df : pandas.DataFrame

A (n_features, n_phenotypes^2) dataframe of the JSD between each feature between and within phenotypes

maybe_renamed_to_feature_id(feature_id)[source]

To be able to give a simple gene name, e.g. “RBFOX2” and get the official ENSG ids or MISO ids

Parameters:

feature_id : str

The name of a feature ID. Could be either a common gene name, as in what the crazy IDs are feature_renamer() to, or

Returns:

feature_id : str or list-like

Valid Feature ID(s) that can be used to subset self.data

nmf_space_positions(groupby, n=0.5)[source]

Calculate NMF-space position of splicing events in phenotype groups

Parameters:

groupby : mappable

A sample id to phenotype mapping

n : int or float

If int, then this is the absolute number of cells that are minimum required to calculate modalities. If a float, then require this fraction of samples to calculate modalities, e.g. if 0.6, then at least 60% of samples must have an event detected for modality detection

Returns:

df : pandas.DataFrame

A (n_events, n_groups) dataframe of NMF positions

nmf_space_transitions(groupby, phenotype_transitions, n=0.5)[source]

Get distance in NMF space of different splicing events

Parameters:

groupby : mappable

A sample id to phenotype mapping

phenotype_transitions : list of str pairs

Which phenotype follows from one to the next, for calculating distances between

n : int or float

If int, then this is the absolute number of cells that are minimum required to calculate modalities. If a float, then require this fraction of samples to calculate modalities, e.g. if 0.6, then at least 60% of samples must have an event detected for modality detection

Returns:

nmf_space_transitions : pandas.DataFrame

A (n_events, n_phenotype_transitions) sized DataFrame of the distances of these events in NMF space

outliers[source]

Data from only the outlier samples

plot_big_nmf_space_transitions(phenotype_groupby, phenotype_transitions, phenotype_order, color, phenotype_to_color, phenotype_to_marker, n=0.5)[source]

Violinplots and NMF transitions of features different in phenotypes

Plot violinplots and NMF-space transitions of features that have large NMF-space transitions between different phenotypes

Parameters:

n : int

Minimum number of samples per phenotype, per event

plot_classifier(trait, sample_ids=None, feature_ids=None, predictor_name=None, standardize=True, score_coefficient=None, data_name=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, color=None, **plotting_kwargs)[source]

Classify samples on boolean or categorical traits

Parameters:

trait : pandas.Series

A (n_samples,) series of categorical features. Must have the same index as data

sample_ids : list-like, optional (default=None)

Which samples to use to classify

feature_ids : list-like, optional (default=None)

Which features to use

predictor_name : str

Name of the predictor to use, in predictor_config_manager

standardize : bool, optional (default=True)

If True, mean-center the data so the mean of all features is 0, and divide by the standard deviation so the standard deviation of all features is 1. This allows us to compare lowly expressed features and highly expressed features on the same playing field

data_name : str, optional (default=None)

Name for this subset of the data

groupby : mappable, optional (default=None)

Map each sample id to a group, such as a phenotype label

label_to_color : dict, optional (default=None)

For each phenotype label, assign a color

label_to_marker : dict, optional (default=None)

For each phenotype label, assign a plotting marker symbol/shape

order : list, optional (default=None)

For violinplots, the order of the phenotype groups

color : list, optional (default=None)

For violinplots, the colors of the phenotypes in their order

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to Classifier.__call__(), which passes them to DecomopsitionViz.__call__()

Returns:

cv : ClassifierViz

Visualziation of the classifier

plot_clustermap(sample_ids=None, feature_ids=None, data=None, feature_colors=None, sample_id_to_color=None, metric='euclidean', method='average', norm_features=True, scale_fig_by_data=True, **kwargs)[source]
plot_correlations(sample_ids=None, feature_ids=None, data=None, featurewise=False, sample_id_to_color=None, metric='euclidean', method='average', scale_fig_by_data=True, **kwargs)[source]
plot_dimensionality_reduction(x_pc=1, y_pc=2, sample_ids=None, feature_ids=None, featurewise=False, reducer=None, plot_violins=False, groupby=None, label_to_color=None, label_to_marker=None, order=None, reduce_kwargs=None, title='', most_variant_features=False, std_multiplier=2, scale_by_variance=True, **plotting_kwargs)[source]

Principal component-like analysis of measurements

Parameters:

x_pc : int, optional

Which principal component to plot on the x-axis (default 1)

y_pc : int, optional

Which principal component to plot on the y-axis (default 2)

sample_ids : list, optional

If None, plot all the samples. If a list of strings, must be valid sample ids of the data. (default None)

feature_ids : list, optional

If None, plot all the features. If a list of strings, perform and plot dimensionality reduction on only these feature ids

featurewise : bool, optional

If True, the features are reduced on the samples, and the plotted points are features, not samples. (default False)

reducer : DataFrameReducerBase, optional

Which decomposition object to use. Must be a child of DataFrameReducerBase as this has built-in compatibility with pandas.DataFrames. (default=:py:class:.DataFramePCA)

plot_violins : bool, optional

If True, plot the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA. (default False)

groupby : mappable, optional

Map each sample id to a group, such as a phenotype label (default None)

label_to_color : dict, optional

For each phenotype label, assign a color (default None)

label_to_marker : dict, optional

For each phenotype label, assign a plotting marker symbol/shape (default None)

order : list, optional

For violinplots, the order of the phenotype groups (default None)

reduce_kwargs : dict, optional

Keyword arguments to the reducer (default None)

title : str, optional

Title of the reduced space plot (default ‘’)

most_variant_features : bool, optional

If True, then only take the most variant of the provided features. The most variant are determined by taking the features whose variance is ``std_multiplier``standard deviations away from the mean feature variance (default False)

std_multiplier : float, optional

If most_variant_features is True, then use this as a cutoff for the minimum variance of a feature to be included (default 2)

scale_by_variance : bool, optional

If True, then scale the x- and y-axes by the explained variance ratio of the principal component dimensions. Only valid for PCA and its variations, not for NMF or tSNE. (default True)

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to DecomopsitionViz.plot()

Returns:

viz : DecompositionViz

Object with plotted dimensionality reduction

plot_feature(feature_id, sample_ids=None, phenotype_groupby=None, phenotype_order=None, color=None, phenotype_to_color=None, phenotype_to_marker=None, nmf_xlabel=None, nmf_ylabel=None, nmf_space=False, fig=None, axesgrid=None)[source]

Plot the violinplot of a feature. Have the option to show NMF movement

plot_nmf_space_transitions(feature_id, groupby, phenotype_to_color, phenotype_to_marker, order, ax=None, xlabel=None, ylabel=None)[source]
plot_outliers(reducer, outlier_detector, **pca_args)[source]
plot_pca(**kwargs)[source]

Call plot_dimensionality_reduction with PCA specifically

plot_two_features(feature1, feature2, groupby=None, label_to_color=None, fillna=None, **kwargs)[source]

Plot the values of two features

plot_two_samples(sample1, sample2, fillna=None, **kwargs)[source]
Parameters:

sample1 : str

Name of the sample to plot on the x-axis

sample2 : str

Name of the sample to plot on the y-axis

fillna : float

Value to replace NAs with

Any other keyword arguments valid for seaborn.jointplot

Returns:

jointgrid : seaborn.axisgrid.JointGrid

Returns a JointGrid instance

See Also

seaborn.jointplot

pooled[source]

Data from only the pooled samples

reduce(sample_ids=None, feature_ids=None, featurewise=False, reducer=<class 'flotilla.compute.decomposition.DataFramePCA'>, standardize=True, reducer_kwargs=None, bins=None, most_variant_features=False, std_multiplier=2, cosine_transform=False)[source]

Make and memoize a reduced dimensionality representation of data

Parameters:

data : pandas.DataFrame

samples x features data to reduce

sample_ids : None or list of strings

If None, all sample ids will be used, else only the sample ids specified

feature_ids : None or list of strings

If None, all features will be used, else only the features specified

featurewise : bool

Whether or not to use the features as the “samples”, e.g. if you want to reduce the features in to “sample-space” instead of reducing the samples into “feature-space”

standardize : bool

Whether or not to “whiten” (make all variables uncorrelated) and mean-center via sklearn.preprocessing.StandardScaler

title : str

Title of the plot

reducer_kwargs : dict

Any additional arguments to send to the reducer

Returns:

reducer_object : flotilla.compute.reduce.ReducerViz

A ready-to-plot object containing the reduced space

singles[source]

Data from only the single cells

static transition_distances(positions, transitions)[source]

Get NMF distance of features between phenotype transitions

Parameters:

positions : pandas.DataFrame

A ((n_features, phenotypes), 2) MultiIndex dataframe of the NMF positions of splicing events for different phenotypes

transitions : list of 2-string tuples

List of (phenotype1, phenotype2) transitions

Returns:

transitions : pandas.DataFrame

A (n_features, n_transitions) DataFrame of the NMF distances of features between different phenotypes

variant[source]

Features whose variance is 2 std devs away from mean variance

flotilla.data_model.base.subsets_from_metadata(metadata, minimum, subset_type, ignore=None)[source]

Get subsets from metadata, including boolean and categorical columns

Parameters:

metadata : pandas.DataFrame

The dataframe whose columns to use to create subsets of the rows

minimum : int

Minimum number of rows required for a column or group in the column to be included

subset_type : str

The name of the kind of subset. e.g. “samples” or “features”

ignore : list-like

List of columns to ignore

Returns:

subsets : dict

A name: row_ids mapping of which samples correspond to which group

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.