flotilla.data_model.base module¶

Common operations performed on all kinds of data types

class flotilla.data_model.base.BaseData(data, thresh=-inf, minimum_samples=0, feature_data=None, feature_rename_col=None, feature_ignore_subset_cols=None, technical_outliers=None, outliers=None, pooled=None, predictor_config_manager=None, data_type=None)[source]¶

Bases: object

Base class for biological data measurements. All data types in flotilla inherit from this

Attributes

`feature_subsets`	Dict of feature subset names to their list of feature ids
`variant`	Genes whose variance among all cells is 2 standard deviations away

data	(pandas.DataFrame) A (n_samples, m_features) sized DataFrame of filtered input data, with features with too few samples (`minimum_samples`) detected at `thresh` removed. Compared to `data_original`, ``m_features <= n_features`
data_type	(str) String indicating what kind of data this is, e.g. “splicing” or “expression”
data_original	(pandas.DataFrame) A (n_samples, n_features) sized DataFrame of all input data, before removing features for having too few samples
feature_data	(pandas.DataFrame) A (k_features, n_features_about_features) sized DataFrame of features about the feature data. Notice that this DataFrame does not need to be the same size as the data, but must at least include all the features from `data`. Compared to `data`, `k_features >= m_features`
predictor_config_manager	(PredictorConfigManager) Manage different combinations of predictor on different data subtypes

Methods

maybe_renamed_to_feature_id(feature_id) To be able to give a simple gene name, e.g.

feature_renamer If feature_rename_col is specified in BaseData.__init__(), this will rename the feature ID to a new name. If feature_rename_col is not specified, then this will return the original id

Abstract base class for biological measurements

Parameters:

Parameters:	data : pandas.DataFrame A samples x features (samples on rows, features on columns) dataframe with some kind of measurements of cells, e.g. gene expression values such as TPM, RPKM or FPKM, alternative splicing “Percent-spliced-in” (PSI) values, or RNA editing scores. Note: If the columns are a multi-index, the “level 0” is assumed to be the unique, crazy ID like ‘ENSG00000100320’, and “level 1” is assumed to be the convenient gene name like “RBFOX2” thresh : float, optional (default=-np.inf) Minimum value to accept for this data. minimum_samples : int, optional (default=0) Minimum number of samples with values greater than `thresh`. E.g., for use with “at least 3 single cells expressing the gene at greater than 1 TPM.” feature_data : pandas.DataFrame, optional (default=None) A features x attributes dataframe of metadata about the features, e.g. annotating whether the gene is a housekeeping gene feature_rename_col : str, optional (default=None) Which column in the feature_data to use to rename feature IDs from a crazy ID to a common gene symbol, e.g. to transform ‘ENSG00000100320’ into ‘RBFOX2’ feature_ignore_subset_cols : list-like (default=None) Columns in the feature data to ignore when making subsets, e.g. “gene_name” shouldn’t be used to create subsets, since it’s just a small number of them. technical_outliers : list-like, optional (default=None) List of sample IDs which should be completely ignored because they didn’t pass the technical quality control outliers : list-like, optional (default=None) List of sample IDs which should be marked as outliers for plotting and interpretation purposes pooled : list-like, optional (default=None) List of sample IDs which should be marked as pooled for plotting and interpretation purposes. predictor_config_manager : PredictorConfigManager, optional (default=None) Object used to organize inputs to `compute.predict.Regressor` and `compute.predict.Classifier`. If None, one is initialized for this instance. data_type : str, optional (default=None) A string indicating what kind of data this is, e.g. “expression” or “splicing”

data : pandas.DataFrame

A samples x features (samples on rows, features on columns) dataframe with some kind of measurements of cells, e.g. gene expression values such as TPM, RPKM or FPKM, alternative splicing “Percent-spliced-in” (PSI) values, or RNA editing scores. Note: If the columns are a multi-index, the “level 0” is assumed to be the unique, crazy ID like ‘ENSG00000100320’, and “level 1” is assumed to be the convenient gene name like “RBFOX2”

thresh : float, optional (default=-np.inf)

Minimum value to accept for this data.

minimum_samples : int, optional (default=0)

Minimum number of samples with values greater than thresh. E.g., for use with “at least 3 single cells expressing the gene at greater than 1 TPM.”

feature_data : pandas.DataFrame, optional (default=None)

A features x attributes dataframe of metadata about the features, e.g. annotating whether the gene is a housekeeping gene

feature_rename_col : str, optional (default=None)

Which column in the feature_data to use to rename feature IDs from a crazy ID to a common gene symbol, e.g. to transform ‘ENSG00000100320’ into ‘RBFOX2’

feature_ignore_subset_cols : list-like (default=None)

Columns in the feature data to ignore when making subsets, e.g. “gene_name” shouldn’t be used to create subsets, since it’s just a small number of them.

technical_outliers : list-like, optional (default=None)

List of sample IDs which should be completely ignored because they didn’t pass the technical quality control

outliers : list-like, optional (default=None)

List of sample IDs which should be marked as outliers for plotting and interpretation purposes

pooled : list-like, optional (default=None)

List of sample IDs which should be marked as pooled for plotting and interpretation purposes.

predictor_config_manager : PredictorConfigManager, optional

(default=None) Object used to organize inputs to compute.predict.Regressor and compute.predict.Classifier. If None, one is initialized for this instance.

data_type : str, optional (default=None)

A string indicating what kind of data this is, e.g. “expression” or “splicing”

Notes

Any cells not marked as “technical_outliers”, “outliers” or “pooled” are considered as single-cell samples.

big_nmf_space_transitions(groupby, phenotype_transitions)[source]¶

binify(data, bins=None)[source]¶

binned_nmf_reduced(*args, **kwargs)[source]¶

classify(trait, sample_ids, feature_ids, standardize=True, data_name='expression', predictor_name='ExtraTreesClassifier', predictor_obj=None, predictor_scoring_fun=None, score_cutoff_fun=None, n_features_dependent_kwargs=None, constant_kwargs=None, plotting_kwargs=None, color=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, bins=None)[source]¶

Make and memoize a predictor on a categorical trait (associated with samples) subset of genes

Parameters:

Parameters:	trait : pandas.Series samples x categorical feature sample_ids : None or list of strings If None, all sample ids will be used, else only the sample ids specified feature_ids : None or list of strings If None, all features will be used, else only the features specified standardize : bool Whether or not to “whiten” (make all variables uncorrelated) and mean-center and make unit-variance all the data via sklearn .preprocessing.StandardScaler predictor : flotilla.visualize.predict classifier Must inherit from flotilla.visualize.PredictorBaseViz. Default is flotilla.visualize.predict.ClassifierViz predictor_kwargs : dict or None Additional ‘keyword arguments’ to supply to the predictor class predictor_scoring_fun : function Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_ score_cutoff_fun : function Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)
Returns:	predictor : flotilla.compute.predict.PredictorBaseViz A ready-to-plot object containing the predictions

trait : pandas.Series

samples x categorical feature

sample_ids : None or list of strings

If None, all sample ids will be used, else only the sample ids specified

feature_ids : None or list of strings

If None, all features will be used, else only the features specified

standardize : bool

Whether or not to “whiten” (make all variables uncorrelated) and mean-center and make unit-variance all the data via sklearn .preprocessing.StandardScaler

predictor : flotilla.visualize.predict classifier

Must inherit from flotilla.visualize.PredictorBaseViz. Default is flotilla.visualize.predict.ClassifierViz

predictor_kwargs : dict or None

Additional ‘keyword arguments’ to supply to the predictor class

predictor_scoring_fun : function

Function to get the feature scores for a scikit-learn classifier. This can be different for different classifiers, e.g. for a classifier named “x” it could be x.scores_, for other it’s x.feature_importances_. Default: lambda x: x.feature_importances_

score_cutoff_fun : function

Function to cut off insignificant scores Default: lambda scores: np.mean(x) + 2 * np.std(x)

Returns:

predictor : flotilla.compute.predict.PredictorBaseViz

A ready-to-plot object containing the predictions

detect_outliers(sample_ids=None, feature_ids=None, featurewise=False, reducer=None, standardize=True, reducer_kwargs=None, bins=None, outlier_detection_method=None, outlier_detection_method_kwargs=None)[source]¶

drop_outliers(data, outliers)[source]¶: Remove outlier samples from this data

feature_renamer_series[source]¶: A pandas Series of the original feature ids to the renamed ids

feature_subset_to_feature_ids(feature_subset, rename=True)[source]¶: Convert a feature subset name to a list of feature ids

feature_subsets[source]¶: Dict of feature subset names to their list of feature ids

maybe_renamed_to_feature_id(feature_id)[source]¶

To be able to give a simple gene name, e.g. “RBFOX2” and get the official ENSG ids or MISO ids

Parameters:

Parameters:	feature_id : str The name of a feature ID. Could be either a common gene name, as in what the crazy IDs are `feature_renamer()` to, or
Returns:	feature_id : str or list-like Valid Feature ID(s) that can be used to subset self.data

feature_id : str

The name of a feature ID. Could be either a common gene name, as in what the crazy IDs are feature_renamer() to, or

Returns:

feature_id : str or list-like

Valid Feature ID(s) that can be used to subset self.data

nmf_space_positions(groupby, min_samples_per_group=5)[source]¶

outliers[source]¶: Data from only the outlier samples

plot_big_nmf_space_transitions(phenotype_groupby, phenotype_transitions, phenotype_order, color, phenotype_to_color, phenotype_to_marker)[source]¶

plot_classifier(trait, sample_ids=None, feature_ids=None, predictor_name=None, standardize=True, score_coefficient=None, data_name=None, groupby=None, label_to_color=None, label_to_marker=None, order=None, color=None, **plotting_kwargs)[source]¶

Classify samples on boolean or categorical traits

Parameters:

Parameters:	trait : pandas.Series A (n_samples,) series of categorical features. Must have the same index as `data` sample_ids : list-like, optional (default=None) Which samples to use to classify feature_ids : list-like, optional (default=None) Which features to use predictor_name : str Name of the predictor to use, in `predictor_config_manager` standardize : bool, optional (default=True) If True, mean-center the data so the mean of all features is 0, and divide by the standard deviation so the standard deviation of all features is 1. This allows us to compare lowly expressed features and highly expressed features on the same playing field data_name : str, optional (default=None) Name for this subset of the data groupby : mappable, optional (default=None) Map each sample id to a group, such as a phenotype label label_to_color : dict, optional (default=None) For each phenotype label, assign a color label_to_marker : dict, optional (default=None) For each phenotype label, assign a plotting marker symbol/shape order : list, optional (default=None) For violinplots, the order of the phenotype groups color : list, optional (default=None) For violinplots, the colors of the phenotypes in their order plotting_kwargs : other keyword arguments All other keyword arguments are passed to `Classifier.__call__()`, which passes them to `DecomopsitionViz.__call__()`
Returns:	self : BaseData

trait : pandas.Series

A (n_samples,) series of categorical features. Must have the same index as data

sample_ids : list-like, optional (default=None)

Which samples to use to classify

feature_ids : list-like, optional (default=None)

Which features to use

predictor_name : str

Name of the predictor to use, in predictor_config_manager

standardize : bool, optional (default=True)

If True, mean-center the data so the mean of all features is 0, and divide by the standard deviation so the standard deviation of all features is 1. This allows us to compare lowly expressed features and highly expressed features on the same playing field

data_name : str, optional (default=None)

Name for this subset of the data

groupby : mappable, optional (default=None)

Map each sample id to a group, such as a phenotype label

label_to_color : dict, optional (default=None)

For each phenotype label, assign a color

label_to_marker : dict, optional (default=None)

For each phenotype label, assign a plotting marker symbol/shape

order : list, optional (default=None)

For violinplots, the order of the phenotype groups

color : list, optional (default=None)

For violinplots, the colors of the phenotypes in their order

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to Classifier.__call__(), which passes them to DecomopsitionViz.__call__()

Returns:

self : BaseData

plot_dimensionality_reduction(x_pc=1, y_pc=2, sample_ids=None, feature_ids=None, featurewise=False, reducer=None, plot_violins=True, groupby=None, label_to_color=None, label_to_marker=None, order=None, reduce_kwargs=None, title='', **plotting_kwargs)[source]¶

Principal component-like analysis of measurements

Parameters:

Parameters:	x_pc : int, optional (default=1) Which principal component to plot on the x-axis y_pc : int, optional (default=2) Which principal component to plot on the y-axis sample_ids : list, optional (default=None) If None, plot all the samples. If a list of strings, must be valid sample ids of the data. feature_ids : list, optional (default=None) If None, plot all the features. If a list of strings, perform and plot dimensionality reduction on only these feature ids featurewise : bool, optional (default=False) Whether to keep the features and reduce on the samples (default is to keep the samples and reduce the features) reducer : `DataFrameReducerBase`, optional (default=:py:class:.DataFramePCA) Which decomposition object to use. Must be a child of `DataFrameReducerBase` as this has built-in compatibility with pandas.DataFrames. plot_violins : bool, optional (default=True) If True, plot the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA. groupby : mappable, optional (default=None) Map each sample id to a group, such as a phenotype label label_to_color : dict, optional (default=None) For each phenotype label, assign a color label_to_marker : dict, optional (default=None) For each phenotype label, assign a plotting marker symbol/shape order : list, optional (default=None) For violinplots, the order of the phenotype groups color : list, optional (default=None) For violinplots, the colors of the phenotypes in their order plotting_kwargs : other keyword arguments All other keyword arguments are passed to `DecomopsitionViz.__call__()`
Returns:	viz : `DecompositionViz` Object with plotted dimensionality reduction

x_pc : int, optional (default=1)

Which principal component to plot on the x-axis

y_pc : int, optional (default=2)

Which principal component to plot on the y-axis

sample_ids : list, optional (default=None)

If None, plot all the samples. If a list of strings, must be valid sample ids of the data.

feature_ids : list, optional (default=None)

If None, plot all the features. If a list of strings, perform and plot dimensionality reduction on only these feature ids

featurewise : bool, optional (default=False)

Whether to keep the features and reduce on the samples (default is to keep the samples and reduce the features)

reducer : DataFrameReducerBase, optional

(default=:py:class:.DataFramePCA) Which decomposition object to use. Must be a child of DataFrameReducerBase as this has built-in compatibility with pandas.DataFrames.

plot_violins : bool, optional (default=True)

If True, plot the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA.

groupby : mappable, optional (default=None)

Map each sample id to a group, such as a phenotype label

label_to_color : dict, optional (default=None)

For each phenotype label, assign a color

label_to_marker : dict, optional (default=None)

For each phenotype label, assign a plotting marker symbol/shape

order : list, optional (default=None)

For violinplots, the order of the phenotype groups

color : list, optional (default=None)

For violinplots, the colors of the phenotypes in their order

plotting_kwargs : other keyword arguments

All other keyword arguments are passed to DecomopsitionViz.__call__()

Returns:

viz : DecompositionViz

Object with plotted dimensionality reduction

plot_feature(feature_id, sample_ids=None, phenotype_groupby=None, phenotype_order=None, color=None, phenotype_to_color=None, phenotype_to_marker=None, xlabel=None, ylabel=None, nmf_space=False)[source]¶: Plot the violinplot of a splicing event (should also show NMF movement)

plot_nmf_space_transitions(feature_id, groupby, phenotype_to_color, phenotype_to_marker, order, ax=None, xlabel=None, ylabel=None)[source]¶

plot_outliers(reducer, outlier_detector, **pca_args)[source]¶

plot_pca(**kwargs)[source]¶: Call plot_dimensionality_reduction with PCA specifically

plot_two_features(feature1, feature2, groupby=None, label_to_color=None, **kwargs)[source]¶: Plot the values of two features

plot_two_samples(sample1, sample2, **kwargs)[source]¶

Parameters:

Parameters:	sample1 : str Name of the sample to plot on the x-axis sample2 : str Name of the sample to plot on the y-axis Any other keyword arguments valid for seaborn.jointplot
Returns:	jointgrid : seaborn.axisgrid.JointGrid Returns a JointGrid instance See Also seaborn.jointplot

sample1 : str

Name of the sample to plot on the x-axis

sample2 : str

Name of the sample to plot on the y-axis

Any other keyword arguments valid for seaborn.jointplot

Returns:

jointgrid : seaborn.axisgrid.JointGrid

Returns a JointGrid instance