flotilla.data_model package

Module contents

class flotilla.data_model.Study(sample_metadata, version='0.1.0', expression_data=None, expression_feature_data=None, expression_feature_rename_col=None, expression_feature_ignore_subset_cols=None, expression_log_base=None, expression_thresh=-inf, expression_plus_one=False, splicing_data=None, splicing_feature_data=None, splicing_feature_rename_col=None, splicing_feature_ignore_subset_cols=None, splicing_feature_expression_id_col=None, mapping_stats_data=None, mapping_stats_number_mapped_col=None, mapping_stats_min_reads=500000.0, spikein_data=None, spikein_feature_data=None, drop_outliers=True, species=None, gene_ontology_data=None, predictor_config_manager=None, metadata_pooled_col='pooled', metadata_minimum_samples=0, metadata_phenotype_col='phenotype', metadata_phenotype_order=None, metadata_phenotype_to_color=None, metadata_phenotype_to_marker=None, metadata_outlier_col='outlier', license=None, title=None, sources=None, default_sample_subset='all_samples', default_feature_subset='variant')

Bases: object

A biological study, with associated metadata, expression, and splicing data.

Construct a biological study

This class only accepts data, no filenames. All data must already have been read in and exist as Python objects.

Parameters:

sample_metadata : pandas.DataFrame

The only required parameter. Samples as the index, with features as columns. Required column: “phenotype”. If there is a boolean column “pooled”, this will be used to separate pooled from single cells. Similarly, the column “outliers” will also be used to separate outlier cells from the rest.

version : str

A string describing the semantic version of the data. Must be in: major.minor.patch format, as the “patch” number will be increased if you change something in the study and then study.save() it. (default “0.1.0”)

expression_data : pandas.DataFrame

Samples x feature dataframe of gene expression measurements, e.g. from an RNA-Seq or a microarray experiment. Assumed to be log-transformed, i.e. you took the log of it. (default None)

expression_feature_data : pandas.DatFrame

Features x annotations dataframe describing other parameters of the gene expression features, e.g. mapping Ensembl IDs to gene symbols or gene biotypes. (default None)

expression_feature_rename_col : str

A column name in the expression_feature_data dataframe that you’d like to rename the expression features to, in the plots. For example, if your gene IDs are Ensembl IDs, but you want to plot UCSC IDs, make sure the column you want, e.g. “ucsc_id” is in your dataframe and specify that. (default “gene_name”)

expression_log_base : float

If you want to log-transform your expression data (and it’s not already log-transformed), use this number as the base of the transform. E.g. expression_log_base=10 will take the log10 of your data. (default None)

thresh : float

Minimum (non log-transformed) expression value. (default -inf)

expression_plus_one : bool

Whether or not to add 1 to the expression data. (default False)

splicing_data : pandas.DataFrame

Samples x feature dataframe of percent spliced in scores, e.g. as measured by the program MISO. Assumed that these values only fall between 0 and 1.

splicing_feature_data : pandas.DataFrame

features x other_features dataframe describing other parameters of the splicing features, e.g. mapping MISO IDs to Ensembl IDs or gene symbols or transcript types

splicing_feature_rename_col : str

A column name in the splicing_feature_data dataframe that you’d like to rename the splicing features to, in the plots. For example, if your splicing IDs are MISO IDs, but you want to plot Ensembl IDs, make sure the column you want, e.g. “ensembl_id” is in your dataframe and specify that. Default “gene_name”.

splicing_feature_expression_id_col : str

A column name in the splicing_feature_data dataframe that corresponds to the row names of the expression data

mapping_stats_data : pandas.DataFrame

Samples x feature dataframe of mapping stats measurements. Currently, this

mapping_stats_number_mapped_col : str

A column name in the mapping_stats_data which specifies the number of (uniquely or not) mapped reads. Default “Uniquely mapped reads number”

spikein_data : pandas.DataFrame

samples x features DataFrame of spike-in expression values

spikein_feature_data : pandas.DataFrame

Features x other_features dataframe, e.g. of the molecular concentration of particular spikein transcripts

drop_outliers : bool

Whether or not to drop samples indicated as outliers in the sample_metadata from the other data, i.e. with a column named ‘outlier’ in sample_metadata, then remove those samples from expression_data for further analysis

species : str

Name of the species and genome version, e.g. ‘hg19’ or ‘mm10’.

gene_ontology_data : pandas.DataFrame

Gene ids x ontology categories dataframe used for GO analysis.

metadata_pooled_col : str

Column in metadata_data which specifies as a boolean whether or not this sample was pooled.

big_nmf_space_transitions(phenotype_transitions='all', data_type='splicing', n=0.5)

Splicing events whose change in NMF space is large

By large, we mean that difference is 2 standard deviations away from the mean

Parameters:

phenotype_transitions : list of length-2 tuples of str

List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in

data_type : ‘splicing’ | ‘expression’

Which data type to calculate this on. (default=’splicing’)

n : int

Minimum number of samples per phenotype, per event

Returns:

big_transitions : pandas.DataFrame

A (n_events, n_transitions) dataframe of the NMF distances between splicing events

celltype_event_counts

Number of cells that detected each event, per celltype

celltype_sizes(data_type='splicing')
compute_expression_splicing_covariance()
default_feature_set_ids = ['not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'variant', 'all features']
default_feature_subsets
default_sample_subsets
detect_outliers(data_type='expression', sample_subset=None, feature_subset=None, featurewise=False, reducer=None, standardize=None, reducer_kwargs=None, bins=None, outlier_detection_method=None, outlier_detection_method_kwargs=None)
drop_outliers()

Assign samples marked as “outlier” in metadata, to other datas

expression_vs_inconsistent_splicing(bins=None)

Percentage of events inconsistent with pooled at expression threshs

Parameters:

bins : list-like

List of expression cutoffs

Returns:

expression_vs_inconsistent : pd.DataFrame

A (len(bins), n_phenotypes) dataframe of the percentage of events in single cells that are inconsistent with pooled

feature_subset_to_feature_ids(data_type, feature_subset=None, rename=False)

Given a name of a feature subset, get the associated feature ids

Parameters:

data_type : str

A string describing the data type, e.g. “expression”

feature_subset : str

A string describing the subset of data type (must be already calculated)

Returns:

feature_ids : list of strings

List of features ids from the specified datatype

filter_splicing_on_expression(expression_thresh, sample_subset=None)

Filter splicing events on expression values

Parameters:

expression_thresh : float

Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.

Returns:

psi : pandas.DataFrame

A (n_samples, n_features)

classmethod from_datapackage(datapackage, datapackage_dir='./', load_species_data=True, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')

Create a study object from a datapackage dictionary

Parameters:

datapackage : dict

Returns:

study : flotilla.Study

Study object

classmethod from_datapackage_file(datapackage_filename, load_species_data=True, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')
classmethod from_datapackage_url(datapackage_url, load_species_data=True, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')

Create a study from a url of a datapackage.json file

Parameters:

datapackage_url : str

HTTP url of a datapackage.json file, following the specification described here: http://dataprotocols.org/data-packages/ and requiring the following data resources: metadata, expression, splicing

species_data_pacakge_base_url : str

Base URL to fetch species-specific _ and splicing event metadata frnm. Default ‘https://s3-us-west-2.amazonaws.com/flotilla-projects/

Returns:

study : Study

A “study” object containing the data described in the datapackage_url file

Raises:

AttributeError

If the datapackage.json file does not contain the required resources of metadata, expression, and splicing.

go_enrichment(feature_ids, background=None, domain=None, p_value_cutoff=1000000, min_feature_size=3, min_background_size=5)

Calculate gene ontology enrichment of provided features

Parameters:

feature_ids : list-like

Features to calculate gene ontology enrichment on

background : list-like, optional

Features to use as the background

domain : str or list, optional

Only calculate GO enrichment for a particular GO category or subset of categories. Valid domains: ‘biological_process’, ‘molecular_function’, ‘cellular_component’

p_value_cutoff : float, optional

Maximum accepted Bonferroni-corrected p-value

min_feature_size : int, optional

Minimum number of features of interest overlapping in a GO Term, to calculate enrichment

min_background_size : int, optional

Minimum number of features in the background overlapping a GO Term

Returns

——-

enrichment : pandas.DataFrame

A (go_categories, columns) dataframe showing the GO enrichment categories that were enriched in the features

initializers = {'splicing_data': <class 'flotilla.data_model.splicing.SplicingData'>, 'expression_data': <class 'flotilla.data_model.expression.ExpressionData'>, 'mapping_stats_data': <class 'flotilla.data_model.quality_control.MappingStatsData'>, 'spikein_data': <class 'flotilla.data_model.expression.SpikeInData'>, 'metadata_data': <class 'flotilla.data_model.metadata.MetaData'>}
interactive_choose_outliers(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, featurewise=False, x_pc=(1, 3), y_pc=(1, 3), show_point_labels=False, kernel=('rbf', 'linear', 'poly', 'sigmoid'), gamma=(0, 25), nu=(0.1, 9.9))
interactive_classifier(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, categorical_variables=None, predictor_types=None, score_coefficient=(0.1, 20), draw_labels=False)
interactive_clustermap()
interactive_correlations()
interactive_graph(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, featurewise=False, cov_std_cut=(0.1, 3), degree_cut=(0, 10), n_pcs=(2, 100), draw_labels=False, feature_of_interest='RBFOX2', weight_fun=None, use_pc_1=True, use_pc_2=True, use_pc_3=True, use_pc_4=True, savefile='figures/last.graph.pdf')
interactive_lavalamp_pooled_inconsistent(sample_subsets=None, feature_subsets=None, difference_threshold=(0.001, 1.0), colors=['red', 'green', 'blue', 'purple', 'yellow'], savefile='')
interactive_pca(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, color_samples_by=None, featurewise=False, x_pc=(1, 10), y_pc=(1, 10), show_point_labels=False, list_link='', plot_violins=False, scale_by_variance=True, savefile='figures/last.pca.pdf')
interactive_reset_outliers()

User selects from columns that start with ‘outlier_‘ to merge multiple outlier classifications

jsd()

Performs Jensen-Shannon Divergence on both splicing and expression study_data

Jensen-Shannon divergence is a method of quantifying the amount of change in distribution of one measurement (e.g. a splicing event or a gene expression) from one celltype to another.

static load_species_data(species, readers, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')
static maybe_make_directory(filename)
modality_assignments(sample_subset=None, feature_subset=None, expression_thresh=-inf, min_samples=0.5)

Get modality assignments of splicing data

Parameters:

sample_subset : str or None, optional

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None, optional

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float, optional

Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.

Returns:

modalities : pandas.DataFrame

A (n_phenotypes, n_events) shaped DataFrame of the assigned modality

modality_counts(sample_subset=None, feature_subset=None, expression_thresh=-inf, min_samples=0.5)

Get number of splicing events in modality categories

Parameters:

sample_subset : str or None, optional

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None, optional

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float, optional

Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.

Returns:

modalities : pandas.DataFrame

A (n_phenotypes, n_modalities) shaped DataFrame of the number of events assigned to each modality

nmf_space_positions(data_type='splicing')
nmf_space_transitions(phenotype_transitions='all', data_type='splicing', n=0.5)

The change in NMF space of splicing events across phenotypes

Parameters:

phenotype_transitions : list of length-2 tuples of str

List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in

data_type : ‘splicing’ | ‘expression’

Which data type to calculate this on. (default=’splicing’)

n : int

Minimum number of samples per phenotype, per event

Returns:

big_transitions : pandas.DataFrame

A (n_events, n_transitions) dataframe of the NMF distances between splicing events

normalize_to_spikein()
percent_pooled_inconsistent(sample_subset=None, feature_subset=None, fraction_diff_thresh=0.1, expression_thresh=-inf)
percent_unique_celltype_events(n=1)
phenotype_col
phenotype_color_ordered
phenotype_order
phenotype_to_color
phenotype_to_marker
phenotype_transitions
plot_big_nmf_space_transitions(data_type='expression', n=5)
plot_classifier(trait, sample_subset=None, feature_subset='all_genes', data_type='expression', title='', show_point_labels=False, **kwargs)

Plot a predictor for the specified data type and trait(s)

Parameters:

data_type : str

One of the names of the data types, e.g. “expression” or “splicing”

trait : str

Column name in the metadata data that you would like to classify on

plot_clustermap(sample_subset=None, feature_subset=None, data_type='expression', metric='euclidean', method='average', figsize=None, scale_fig_by_data=True, **kwargs)

Visualize hierarchical relationships within samples and features

plot_correlations(sample_subset=None, feature_subset=None, data_type='expression', metric='euclidean', method='average', figsize=None, featurewise=False, scale_fig_by_data=True, **kwargs)

Visualize clustered correlations of samples across features

plot_event(feature_id, sample_subset=None, nmf_space=False)

Plot the violinplot and NMF transitions of a splicing event

plot_event_modality_estimation(event_id, sample_subset=None, expression_thresh=-inf)
plot_expression_vs_inconsistent_splicing(bins=None)
plot_gene(feature_id, sample_subset=None, nmf_space=False)
plot_graph(data_type='expression', sample_subset=None, feature_subset=None, featurewise=False, **kwargs)

Plot the graph (network) of these data

Parameters:

data_type : str

One of the names of the data types, e.g. “expression” or “splicing”

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

plot_lavalamp_pooled_inconsistent(sample_subset=None, feature_subset=None, fraction_diff_thresh=0.1, expression_thresh=-inf)
plot_lavalamps(sample_subset=None, feature_subset=None, expression_thresh=-inf)
plot_modalities_bars(sample_subset=None, feature_subset=None, expression_thresh=-inf, percentages=True)

Make grouped barplots of the number of modalities per phenotype

Parameters:

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float

If greater than -inf, then filter on splicing events in genes with expression at least this value

percentages : bool

If True, plot percentages instead of counts

plot_modalities_lavalamps(sample_subset=None, feature_subset=None, expression_thresh=-inf)

Plot each modality in each celltype on a separate axes

Parameters:

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float

If greater than -inf, then filter on splicing events in genes with expression at least this value

plot_modalities_reduced(sample_subset=None, feature_subset=None, expression_thresh=-inf)

Plot splicing events with modality assignments in NMF space

This will plot a separate NMF space for each celltype in the data, as well as one for all samples.

Parameters:

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float

If greater than -inf, then filter on splicing events in genes with expression at least this value

plot_pca(data_type='expression', x_pc=1, y_pc=2, sample_subset=None, feature_subset=None, title='', featurewise=False, plot_violins=False, show_point_labels=False, reduce_kwargs=None, color_samples_by=None, bokeh=False, most_variant_features=False, std_multiplier=2, scale_by_variance=True, **kwargs)

Performs DataFramePCA on both expression and splicing study_data

Parameters:

data_type : str

One of the names of the data types, e.g. “expression” or “splicing” (default “expression”)

x_pc : int, optional

Which principal component to plot on the x-axis (default 1)

y_pc : int, optional

Which principal component to plot on the y-axis (default 2)

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. (default None)

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. (default None)

title : str, optional

Title of the reduced space plot (default ‘’)

featurewise : bool, optional

If True, the features are reduced on the samples, and the plotted points are features, not samples. (default False)

plot_violins : bool

Whether or not to make the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA. (default False)

show_point_labels : bool, optional

Whether or not to show the labels of the points. If this is samplewise (default), then this labels the samples. If this is featurewise, then this labels the features. (default False)

reduce_kwargs : dict, optional

Keyword arguments to the reducer (default None)

color_samples_by : str, optional

Instead of coloring the samples by their phenotype, color them by this column in the metadata. (default None)

bokeh : bool, optional

If True, plot a javascripty/interactive bokeh plot instead of a static printable figure (default False)

most_variant_features : bool, optional

If True, then only take the most variant of the provided features. The most variant are determined by taking the features whose variance is ``std_multiplier``standard deviations away from the mean feature variance (default False)

std_multiplier : float, optional

If most_variant_features is True, then use this as a cutoff for the minimum variance of a feature to be included (default 2)

scale_by_variance : bool, optional

If True, then scale the x- and y-axes by the explained variance ratio of the principal component dimensions. Only valid for PCA and its variations, not for NMF or tSNE. (default True)

kwargs : other keyword arguments

All other keyword arguments are passed to DecomopsitionViz.plot()

plot_regressor(data_type='expression', **kwargs)
plot_two_features(feature1, feature2, data_type='expression', **kwargs)

Make a scatterplot of two features’ data

Parameters:

feature1 : str

Name of the feature to plot on the x-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320”

feature2 : str

Name of the feature to plot on the y-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320”

plot_two_samples(sample1, sample2, data_type='expression', **kwargs)

Plot a scatterplot of two samples’ data

Parameters:

sample1 : str

Name of the sample to plot on the x-axis

sample2 : str

Name of the sample to plot on the y-axis

data_type : “expression” | “splicing”

Type of data to plot. Default “expression”

Any other keyword arguments valid for seaborn.jointplot

Returns:

jointgrid : seaborn.axisgrid.JointGrid

Returns a JointGrid instance

See Also

seaborn.jointplot

readers = {'tsv': <function load_tsv at 0x2ba837697578>, 'json': <function load_json at 0x2ba8376975f0>, 'csv': <function load_csv at 0x2ba8376976e0>, 'gzip_pickle_df': <function load_gzip_pickle_df at 0x2ba837697488>, 'pickle_df': <function load_pickle_df at 0x2ba837697398>}
sample_id_to_color
sample_id_to_phenotype
sample_subset_to_sample_ids(phenotype_subset=None)

Convert a string naming a subset of phenotypes in the data into sample ids

Parameters:

phenotype_subset : str

A valid string describing a boolean phenotype described in the metadata data

Returns:

sample_ids : list of strings

List of sample ids in the data

save(name, flotilla_dir='/home/travis/flotilla_projects')
unique_celltype_event_counts(n=1)
class flotilla.data_model.ExpressionData(data, feature_data=None, thresh=-inf, feature_rename_col=None, feature_ignore_subset_cols=None, outliers=None, log_base=None, pooled=None, plus_one=False, minimum_samples=0, technical_outliers=None, predictor_config_manager=None)

Bases: flotilla.data_model.base.BaseData

Object for holding and operating on expression data

binify(*args, **kwargs)
class flotilla.data_model.SpikeInData(data, feature_data=None, predictor_config_manager=None, technical_outliers=None)

Bases: flotilla.data_model.expression.ExpressionData

Class for Spikein data and associated functions Attributes ———-

Constructor for

Parameters:data, experiment_design_data
class flotilla.data_model.GeneOntologyData(data)

Bases: object

Object to calculate enrichment of Gene Ontology terms

Acceptable Gene Ontology tables can be downloaded from ENSEMBL’s BioMart tool: http://www.ensembl.org/biomart

  1. Choose “Ensembl Genes ##” (## = version number, for me it’s 78)
  2. Click “Attributes”
  3. Expand “EXTERNAL”
  4. Check the boxes for ‘GO Term Accession’, ‘Ensembl Gene ID’, ‘GO Term Name’, and ‘GO domain’
Parameters:

data : pandas.DataFrame

A dataframe with at least the following columns: ‘GO Term Accession’, ‘Ensembl Gene ID’, ‘GO Term Name’, ‘GO domain’

domains = frozenset(['molecular_function', 'cellular_component', 'biological_process'])
enrichment(features_of_interest, background=None, p_value_cutoff=1000000, cross_reference=None, min_feature_size=3, min_background_size=5, domain=None)

Bonferroni-corrected hypergeometric p-values of GO enrichment

Calculates hypergeometric enrichment of the features of interest, in each GO category.

Parameters:

features_of_interest : list-like

List of features. Must match the identifiers in the ontology database exactly, i.e. if your ontology database is ENSEMBL ids, then you can only provide those and not common names like “RBFOX2”

background : list-like, optional

Background genes to use. It is best to use a relevant background such as all expressed genes. If None, defaults to all genes.

p_value_cutoff : float, optional

Maximum accepted Bonferroni-corrected p-value

cross_reference : dict-like, optional

A mapping of gene ids to gene symbols, e.g. a pandas Series of ENSEMBL genes e.g. ENSG00000139675 to gene symbols e.g HNRNPA1L2

min_feature_size : int, optional

Minimum number of features of interest overlapping in a GO Term, to calculate enrichment

min_background_size : int, optional

Minimum number of features in the background overlapping a GO Term

domain : str or list, optional

Only calculate GO enrichment for a particular GO category or subset of categories. Valid domains: ‘biological_process’, ‘molecular_function’, ‘cellular_component’

Returns:

enrichment_df : pandas.DataFrame

A (n_go_categories, columns) DataFrame of the enrichment scores

Raises:

ValueError

If features of interest and background do not overlap, or invalid GO domains are given

class flotilla.data_model.MetaData(data, phenotype_order=None, phenotype_to_color=None, phenotype_to_marker=None, phenotype_col='phenotype', pooled_col='pooled', outlier_col='outlier', predictor_config_manager=None, minimum_sample_subset=10)

Bases: flotilla.data_model.base.BaseData

n_phenotypes
phenotype_color_order
phenotype_order
phenotype_series
phenotype_to_color
phenotype_to_marker
phenotype_transitions
sample_id_to_color
sample_id_to_phenotype
sample_subsets
unique_phenotypes
class flotilla.data_model.MappingStatsData(data, number_mapped_col, min_reads=500000.0, predictor_config_manager=None)

Bases: flotilla.data_model.base.BaseData

Constructor for mapping statistics data from STAR

Constructor for MappingStatsData

Parameters:data, sample_descriptors
number_mapped
too_few_mapped
class flotilla.data_model.SplicingData(data, feature_data=None, binsize=0.1, outliers=None, feature_rename_col=None, feature_ignore_subset_cols=None, excluded_max=0.2, included_min=0.8, pooled=None, predictor_config_manager=None, technical_outliers=None, minimum_samples=0, feature_expression_id_col=None)

Bases: flotilla.data_model.base.BaseData

Instantiate a object for percent spliced in (PSI) scores

Parameters:

data : pandas.DataFrame

A [n_events, n_samples] dataframe of data events

n_components : int

Number of components to use in the reducer

binsize : float

Value between 0 and 1, the bin size for binning the study_data scores

excluded_max : float

Maximum value for the “excluded” bin of psi scores. Default 0.2.

included_max : float

Minimum value for the “included” bin of psi scores. Default 0.8.

Notes

‘thresh’ from BaseData is not used.

binify(data)
binned_reducer = None
excluded_label = 'excluded >>'
included_label = 'included >>'
modality_assignments(*args, **kwargs)

Assigned modalities for these samples and features.

Parameters:

sample_ids : list of str, optional

Which samples to use. If None, use all. Default None.

feature_ids : list of str, optional

Which features to use. If None, use all. Default None.

data : pandas.DataFrame, optional

If provided, use this dataframe instead of the sample_ids and feature_ids provided

Returns:

modality_assignments : pandas.Series

The modality assignments of each feature given these samples

modality_counts(*args, **kwargs)

Count the number of each modalities of these samples and features

Parameters:

sample_ids : list of str

Which samples to use. If None, use all. Default None.

feature_ids : list of str

Which features to use. If None, use all. Default None.

data : pandas.DataFrame, optional

If provided, use this dataframe instead of the sample_ids and feature_ids provided

Returns:

modalities_counts : pandas.Series

The number of events detected in each modality

n_components = 2
plot_event_modality_estimation(event_id, sample_ids=None, data=None, groupby=None, min_samples=0.5)
plot_feature(feature_id, sample_ids=None, phenotype_groupby=None, phenotype_order=None, color=None, phenotype_to_color=None, phenotype_to_marker=None, nmf_xlabel=None, nmf_ylabel=None, nmf_space=False, fig=None, axesgrid=None)
plot_hist_single_vs_pooled_diff(data, feature_ids=None, color=None, title='', hist_kws=None)

Plot histogram of distances between singles and pooled

plot_lavalamp(phenotype_to_color, sample_ids=None, feature_ids=None, data=None, groupby=None, order=None)
plot_lavalamp_pooled_inconsistent(data, feature_ids=None, fraction_diff_thresh=0.1, color=None)
plot_modalities_bars(sample_ids=None, feature_ids=None, data=None, groupby=None, phenotype_to_color=None, percentages=False, ax=None)

Make grouped barplots of the number of modalities per group

Parameters:

sample_ids : None or list of str

Which samples to use. If None, use all

feature_ids : None or list of str

Which features to use. If None, use all

color : None or matplotlib color

Which color to use for plotting the lavalamps of these features and samples

x_offset : numeric

How much to offset the x-axis of each event. Useful if you want to plot the same event, but in several iterations with different celltypes or colors

plot_modalities_lavalamps(sample_ids=None, feature_ids=None, data=None, groupby=None, phenotype_to_color=None)

Plot “lavalamp” scatterplot of each event

Parameters:

sample_ids : None or list of str

Which samples to use. If None, use all

feature_ids : None or list of str

Which features to use. If None, use all

color : None or matplotlib color

Which color to use for plotting the lavalamps of these features and samples

x_offset : numeric

How much to offset the x-axis of each event. Useful if you want to plot the same event, but in several iterations with different celltypes or colors

plot_modalities_reduced(sample_ids=None, feature_ids=None, data=None, ax=None, title=None)

Plot events modality assignments in NMF space

This will calculate modalities on all samples provided, without grouping them by celltype. This is because each NMF axis can only show one set of sample ids’ modalties.

Parameters:

sample_ids : list of str

Which samples to use. If None, use all. Default None.

feature_ids : list of str

Which features to use. If None, use all. Default None.

data : pandas.DataFrame, optional

If provided, use this dataframe instead of the sample_ids and feature_ids provided

ax : matplotlib.axes.Axes object

Axes to plot on. If none, gets current axes

title : str

Title of the reduced space plot

plot_two_features(feature1, feature2, groupby=None, label_to_color=None, fillna=None, **kwargs)
plot_two_samples(sample1, sample2, fillna=None, **kwargs)
pooled_inconsistent(*args, **kwargs)

Return splicing events which pooled samples are consistently different from the single cells.

Parameters:

singles_ids : list-like

List of sample ids of single cells (in the main ”.data” DataFrame)

pooled_ids : list-like

List of sample ids of pooled cells (in the other ”.pooled” DataFrame)

feature_ids : None or list-like

List of feature ids. If None, use all

fraction_diff_thresh : float

Returns:

large_diff : pandas.DataFrame

All splicing events which have a scaled difference larger than the fraction diff thresh

raw_reducer = None
Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.