flotilla.data_model.base module

Common operations performed on all kinds of data types

class flotilla.data_model.base.BaseData(data, thresh=-inf, minimum_samples=0, feature_data=None, feature_rename_col=None, feature_ignore_subset_cols=None, technical_outliers=None, outliers=None, pooled=None, predictor_config_manager=None, data_type=None)[source]

Bases: object

Base class for biological data measurements. All data types in flotilla inherit from this

Attributes

feature_subsets Dict of feature subset names to their list of feature ids
variant Genes whose variance among all cells is 2 standard deviations away
data (pandas.DataFrame) A (n_samples, m_features) sized DataFrame of filtered input data, with features with too few samples (minimum_samples) detected at thresh removed. Compared to data_original, ``m_features <= n_features`
data_type (str) String indicating what kind of data this is, e.g. “splicing” or “expression”
data_original (pandas.DataFrame) A (n_samples, n_features) sized DataFrame of all input data, before removing features for having too few samples
feature_data (pandas.DataFrame) A (k_features, n_features_about_features) sized DataFrame of features about the feature data. Notice that this DataFrame does not need to be the same size as the data, but must at least include all the features from data. Compared to data, k_features >= m_features
predictor_config_manager (PredictorConfigManager) Manage different combinations of predictor on different data subtypes

Methods

maybe_renamed_to_feature_id(feature_id) To be able to give a simple gene name, e.g.
feature_renamer If feature_rename_col is specified in BaseData.__init__(), this will rename the feature ID to a new name. If feature_rename_col is not specified, then this will return the original id

Abstract base class for biological measurements

Parameters:

data : pandas.DataFrame

A samples x features (samples on rows, features on columns) dataframe with some kind of measurements of cells, e.g. gene expression values such as TPM, RPKM or FPKM, alternative splicing “Percent-spliced-in” (PSI) values, or RNA editing scores. Note: If the columns are a multi-index, the “level 0” is assumed to be the unique, crazy ID like ‘ENSG00000100320’, and “level 1” is assumed to be the convenient gene name like “RBFOX2”

thresh : float, optional (default=-np.inf)

Minimum value to accept for this data.

minimum_samples : int, optional (default=0)

Minimum number of samples with values greater than thresh. E.g., for use with “at least 3 single cells expressing the gene at greater than 1 TPM.”

feature_data : pandas.DataFrame, optional (default=None)

A features x attributes dataframe of metadata about the features, e.g. annotating whether the gene is a housekeeping gene

feature_rename_col : str, optional (default=None)

Which column in the feature_data to use to rename feature IDs from a crazy ID to a common gene symbol, e.g. to transform ‘ENSG00000100320’ into ‘RBFOX2’

feature_ignore_subset_cols : list-like (default=None)

Columns in the feature data to ignore when making subsets, e.g. “gene_name” shouldn’t be used to create subsets, since it’s just a small number of them.

technical_outliers : list-like, optional (default=None)

List of sample IDs which should be completely ignored because they didn’t pass the technical quality control

outliers : list-like, optional (default=None)

List of sample IDs which should be marked as outliers for plotting and interpretation purposes

pooled : list-like, optional (default=None)

List of sample IDs which should be marked as pooled for plotting and interpretation purposes.

predictor_config_manager : PredictorConfigManager, optional

(default=None) Object used to organize inputs to compute.predict.Regressor and compute.predict.Classifier. If None, one is initialized for this instance.

data_type : str, optional (default=None)

A string indicating what kind of data this is, e.g. “expression” or “splicing”

Notes

Any cells not marked as “technical_outliers”, “outliers” or “pooled” are considered as single-cell samples.

big_nmf_space_transitions(groupby, phenotype_transitions)[source]
binify(data, bins=None)[source]
binned_nmf_reduced(*args, **kwargs)[source]
classify(trait, sample_ids, feature_ids, standardize=True, data_name='expression', predictor_name='ExtraTreesClassifier', predictor_obj=None, predictor_scoring_fun=None, score_cutoff_fun=None, n_features_dependent_kwargs=None, constant_kwargs=None, plotting_kwargs=None, color=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, bins=None)[source]

Make and memoize a predictor on a categorical trait (associated with samples) subset of genes

Parameters:

trait : pandas.Series

samples x categorical feature

sample_ids : None or list of strings

If None, all sample ids will be used, else only the sample ids specified

feature_ids : None or list of strings

If None, all features will be used, else only the features specified

standardize : bool

Whether or not to “whiten” (make all variables uncorrelated) and mean-center and make unit-variance all the data via sklearn .preprocessing.StandardScaler

predictor : flotilla.visualize.predict classifier

Must inherit from flotilla.visualize.PredictorBaseViz. Default is flotilla.visualize.predict.ClassifierViz

predictor_kwargs : dict or None

Additional ‘keyword arguments’ to supply to the predictor class

predictor_scoring_fun : function

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

Returns:

predictor : flotilla.compute.predict.PredictorBaseViz

A ready-to-plot object containing the predictions

detect_outliers(sample_ids=None, feature_ids=None, featurewise=False, reducer=None, standardize=True, reducer_kwargs=None, bins=None, outlier_detection_method=None, outlier_detection_method_kwargs=None)[source]
drop_outliers(data, outliers)[source]

Remove outlier samples from this data

feature_renamer_series[source]

A pandas Series of the original feature ids to the renamed ids

feature_subset_to_feature_ids(feature_subset, rename=True)[source]

Convert a feature subset name to a list of feature ids

feature_subsets[source]

Dict of feature subset names to their list of feature ids

maybe_renamed_to_feature_id(feature_id)[source]

To be able to give a simple gene name, e.g. “RBFOX2” and get the official ENSG ids or MISO ids

Parameters:

feature_id : str

The name of a feature ID. Could be either a common gene name, as in what the crazy IDs are feature_renamer() to, or

Returns:

feature_id : str or list-like

Valid Feature ID(s) that can be used to subset self.data

nmf_space_positions(groupby, min_samples_per_group=5)[source]
outliers[source]

Data from only the outlier samples

plot_big_nmf_space_transitions(phenotype_groupby, phenotype_transitions, phenotype_order, color, phenotype_to_color, phenotype_to_marker)[source]
plot_classifier(trait, sample_ids=None, feature_ids=None, predictor_name=None, standardize=True, score_coefficient=None, data_name=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, color=None, **plotting_kwargs)[source]

Classify samples on boolean or categorical traits

Parameters:

trait : pandas.Series

A (n_samples,) series of categorical features. Must have the same index as data

sample_ids : list-like, optional (default=None)

Which samples to use to classify

feature_ids : list-like, optional (default=None)

Which features to use

predictor_name : str

Name of the predictor to use, in predictor_config_manager

standardize : bool, optional (default=True)

If True, mean-center the data so the mean of all features is 0, and divide by the standard deviation so the standard deviation of all features is 1. This allows us to compare lowly expressed features and highly expressed features on the same playing field

data_name : str, optional (default=None)

Name for this subset of the data

groupby : mappable, optional (default=None)

Map each sample id to a group, such as a phenotype label

label_to_color : dict, optional (default=None)

For each phenotype label, assign a color

label_to_marker : dict, optional (default=None)

For each phenotype label, assign a plotting marker symbol/shape

order : list, optional (default=None)

For violinplots, the order of the phenotype groups

color : list, optional (default=None)

For violinplots, the colors of the phenotypes in their order

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to Classifier.__call__(), which passes them to DecomopsitionViz.__call__()

Returns:

self : BaseData

plot_dimensionality_reduction(x_pc=1, y_pc=2, sample_ids=None, feature_ids=None, featurewise=False, reducer=None, plot_violins=True, groupby=None, label_to_color=None, label_to_marker=None, order=None, reduce_kwargs=None, title='', **plotting_kwargs)[source]

Principal component-like analysis of measurements

Parameters:

x_pc : int, optional (default=1)

Which principal component to plot on the x-axis

y_pc : int, optional (default=2)

Which principal component to plot on the y-axis

sample_ids : list, optional (default=None)

If None, plot all the samples. If a list of strings, must be valid sample ids of the data.

feature_ids : list, optional (default=None)

If None, plot all the features. If a list of strings, perform and plot dimensionality reduction on only these feature ids

featurewise : bool, optional (default=False)

Whether to keep the features and reduce on the samples (default is to keep the samples and reduce the features)

reducer : DataFrameReducerBase, optional

(default=:py:class:.DataFramePCA) Which decomposition object to use. Must be a child of DataFrameReducerBase as this has built-in compatibility with pandas.DataFrames.

plot_violins : bool, optional (default=True)

If True, plot the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA.

groupby : mappable, optional (default=None)

Map each sample id to a group, such as a phenotype label

label_to_color : dict, optional (default=None)

For each phenotype label, assign a color

label_to_marker : dict, optional (default=None)

For each phenotype label, assign a plotting marker symbol/shape

order : list, optional (default=None)

For violinplots, the order of the phenotype groups

color : list, optional (default=None)

For violinplots, the colors of the phenotypes in their order

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to DecomopsitionViz.__call__()

Returns:

viz : DecompositionViz

Object with plotted dimensionality reduction

plot_feature(feature_id, sample_ids=None, phenotype_groupby=None, phenotype_order=None, color=None, phenotype_to_color=None, phenotype_to_marker=None, xlabel=None, ylabel=None, nmf_space=False)[source]

Plot the violinplot of a splicing event (should also show NMF movement)

plot_nmf_space_transitions(feature_id, groupby, phenotype_to_color, phenotype_to_marker, order, ax=None, xlabel=None, ylabel=None)[source]
plot_outliers(reducer, outlier_detector, **pca_args)[source]
plot_pca(**kwargs)[source]

Call plot_dimensionality_reduction with PCA specifically

plot_two_features(feature1, feature2, groupby=None, label_to_color=None, **kwargs)[source]

Plot the values of two features

plot_two_samples(sample1, sample2, **kwargs)[source]
Parameters:

sample1 : str

Name of the sample to plot on the x-axis

sample2 : str

Name of the sample to plot on the y-axis

Any other keyword arguments valid for seaborn.jointplot

Returns:

jointgrid : seaborn.axisgrid.JointGrid

Returns a JointGrid instance

See Also

seaborn.jointplot

pooled[source]

Data from only the pooled samples

reduce(sample_ids=None, feature_ids=None, featurewise=False, reducer=None, standardize=None, reducer_kwargs=None, bins=None)[source]

Make and memoize a reduced dimensionality representation of data

Parameters:

data : pandas.DataFrame

samples x features data to reduce

sample_ids : None or list of strings

If None, all sample ids will be used, else only the sample ids specified

feature_ids : None or list of strings

If None, all features will be used, else only the features specified

featurewise : bool

Whether or not to use the features as the “samples”, e.g. if you want to reduce the features in to “sample-space” instead of reducing the samples into “feature-space”

standardize : bool

Whether or not to “whiten” (make all variables uncorrelated) and mean-center via sklearn.preprocessing.StandardScaler

title : str

Title of the plot

reducer_kwargs : dict

Any additional arguments to send to the reducer

Returns:

reducer_object : flotilla.compute.reduce.ReducerViz

A ready-to-plot object containing the reduced space

singles[source]

Data from only the single cells

static transition_distances(df, transitions)[source]
variant[source]

Genes whose variance among all cells is 2 standard deviations away from the mean variance

flotilla.data_model.base.subsets_from_metadata(metadata, minimum, subset_type, ignore=None)[source]
Parameters:

metadata : pandas.DataFrame

The dataframe whose columns to use to create subsets of the rows

minimum : int

Minimum number of rows required for a column or group in the column to be included

subset_type : str

The name of the kind of subset. e.g. “samples” or “features”

ignore : list-like

List of columns to ignore

Returns:

subsets : dict

A name: row_ids mapping of which samples correspond to which group

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.