flotilla.data_model.study module¶

Data models for “studies” studies include attributes about the data and are heavier in terms of data load

class flotilla.data_model.study.Study(sample_metadata, version='0.1.0', expression_data=None, expression_feature_data=None, expression_feature_rename_col=None, expression_feature_ignore_subset_cols=None, expression_log_base=None, expression_thresh=-inf, expression_plus_one=False, splicing_data=None, splicing_feature_data=None, splicing_feature_rename_col=None, splicing_feature_ignore_subset_cols=None, splicing_feature_expression_id_col=None, mapping_stats_data=None, mapping_stats_number_mapped_col=None, mapping_stats_min_reads=500000.0, spikein_data=None, spikein_feature_data=None, drop_outliers=True, species=None, gene_ontology_data=None, predictor_config_manager=None, metadata_pooled_col='pooled', metadata_minimum_samples=0, metadata_phenotype_col='phenotype', metadata_phenotype_order=None, metadata_phenotype_to_color=None, metadata_phenotype_to_marker=None, metadata_outlier_col='outlier', license=None, title=None, sources=None, default_sample_subset='all_samples', default_feature_subset='variant')[source]¶

Bases: object

A biological study, with associated metadata, expression, and splicing data.

Construct a biological study

This class only accepts data, no filenames. All data must already have been read in and exist as Python objects.

Parameters:

Parameters:	sample_metadata : pandas.DataFrame The only required parameter. Samples as the index, with features as columns. Required column: “phenotype”. If there is a boolean column “pooled”, this will be used to separate pooled from single cells. Similarly, the column “outliers” will also be used to separate outlier cells from the rest. version : str A string describing the semantic version of the data. Must be in: major.minor.patch format, as the “patch” number will be increased if you change something in the study and then study.save() it. (default “0.1.0”) expression_data : pandas.DataFrame Samples x feature dataframe of gene expression measurements, e.g. from an RNA-Seq or a microarray experiment. Assumed to be log-transformed, i.e. you took the log of it. (default None) expression_feature_data : pandas.DatFrame Features x annotations dataframe describing other parameters of the gene expression features, e.g. mapping Ensembl IDs to gene symbols or gene biotypes. (default None) expression_feature_rename_col : str A column name in the expression_feature_data dataframe that you’d like to rename the expression features to, in the plots. For example, if your gene IDs are Ensembl IDs, but you want to plot UCSC IDs, make sure the column you want, e.g. “ucsc_id” is in your dataframe and specify that. (default “gene_name”) expression_log_base : float If you want to log-transform your expression data (and it’s not already log-transformed), use this number as the base of the transform. E.g. expression_log_base=10 will take the log10 of your data. (default None) thresh : float Minimum (non log-transformed) expression value. (default -inf) expression_plus_one : bool Whether or not to add 1 to the expression data. (default False) splicing_data : pandas.DataFrame Samples x feature dataframe of percent spliced in scores, e.g. as measured by the program MISO. Assumed that these values only fall between 0 and 1. splicing_feature_data : pandas.DataFrame features x other_features dataframe describing other parameters of the splicing features, e.g. mapping MISO IDs to Ensembl IDs or gene symbols or transcript types splicing_feature_rename_col : str A column name in the splicing_feature_data dataframe that you’d like to rename the splicing features to, in the plots. For example, if your splicing IDs are MISO IDs, but you want to plot Ensembl IDs, make sure the column you want, e.g. “ensembl_id” is in your dataframe and specify that. Default “gene_name”. splicing_feature_expression_id_col : str A column name in the splicing_feature_data dataframe that corresponds to the row names of the expression data mapping_stats_data : pandas.DataFrame Samples x feature dataframe of mapping stats measurements. Currently, this mapping_stats_number_mapped_col : str A column name in the mapping_stats_data which specifies the number of (uniquely or not) mapped reads. Default “Uniquely mapped reads number” spikein_data : pandas.DataFrame samples x features DataFrame of spike-in expression values spikein_feature_data : pandas.DataFrame Features x other_features dataframe, e.g. of the molecular concentration of particular spikein transcripts drop_outliers : bool Whether or not to drop samples indicated as outliers in the sample_metadata from the other data, i.e. with a column named ‘outlier’ in sample_metadata, then remove those samples from expression_data for further analysis species : str Name of the species and genome version, e.g. ‘hg19’ or ‘mm10’. gene_ontology_data : pandas.DataFrame Gene ids x ontology categories dataframe used for GO analysis. metadata_pooled_col : str Column in metadata_data which specifies as a boolean whether or not this sample was pooled.

sample_metadata : pandas.DataFrame

The only required parameter. Samples as the index, with features as columns. Required column: “phenotype”. If there is a boolean column “pooled”, this will be used to separate pooled from single cells. Similarly, the column “outliers” will also be used to separate outlier cells from the rest.

version : str

A string describing the semantic version of the data. Must be in: major.minor.patch format, as the “patch” number will be increased if you change something in the study and then study.save() it. (default “0.1.0”)

expression_data : pandas.DataFrame

Samples x feature dataframe of gene expression measurements, e.g. from an RNA-Seq or a microarray experiment. Assumed to be log-transformed, i.e. you took the log of it. (default None)

expression_feature_data : pandas.DatFrame

Features x annotations dataframe describing other parameters of the gene expression features, e.g. mapping Ensembl IDs to gene symbols or gene biotypes. (default None)

expression_feature_rename_col : str

A column name in the expression_feature_data dataframe that you’d like to rename the expression features to, in the plots. For example, if your gene IDs are Ensembl IDs, but you want to plot UCSC IDs, make sure the column you want, e.g. “ucsc_id” is in your dataframe and specify that. (default “gene_name”)

expression_log_base : float

If you want to log-transform your expression data (and it’s not already log-transformed), use this number as the base of the transform. E.g. expression_log_base=10 will take the log10 of your data. (default None)

thresh : float

Minimum (non log-transformed) expression value. (default -inf)

expression_plus_one : bool

Whether or not to add 1 to the expression data. (default False)

splicing_data : pandas.DataFrame

Samples x feature dataframe of percent spliced in scores, e.g. as measured by the program MISO. Assumed that these values only fall between 0 and 1.

splicing_feature_data : pandas.DataFrame

features x other_features dataframe describing other parameters of the splicing features, e.g. mapping MISO IDs to Ensembl IDs or gene symbols or transcript types

splicing_feature_rename_col : str

A column name in the splicing_feature_data dataframe that you’d like to rename the splicing features to, in the plots. For example, if your splicing IDs are MISO IDs, but you want to plot Ensembl IDs, make sure the column you want, e.g. “ensembl_id” is in your dataframe and specify that. Default “gene_name”.

splicing_feature_expression_id_col : str

A column name in the splicing_feature_data dataframe that corresponds to the row names of the expression data

mapping_stats_data : pandas.DataFrame

Samples x feature dataframe of mapping stats measurements. Currently, this

mapping_stats_number_mapped_col : str

A column name in the mapping_stats_data which specifies the number of (uniquely or not) mapped reads. Default “Uniquely mapped reads number”

spikein_data : pandas.DataFrame

samples x features DataFrame of spike-in expression values

spikein_feature_data : pandas.DataFrame

Features x other_features dataframe, e.g. of the molecular concentration of particular spikein transcripts

drop_outliers : bool

Whether or not to drop samples indicated as outliers in the sample_metadata from the other data, i.e. with a column named ‘outlier’ in sample_metadata, then remove those samples from expression_data for further analysis

species : str

Name of the species and genome version, e.g. ‘hg19’ or ‘mm10’.

gene_ontology_data : pandas.DataFrame

Gene ids x ontology categories dataframe used for GO analysis.

metadata_pooled_col : str

Column in metadata_data which specifies as a boolean whether or not this sample was pooled.

big_nmf_space_transitions(phenotype_transitions='all', data_type='splicing', n=0.5)[source]¶

Splicing events whose change in NMF space is large

By large, we mean that difference is 2 standard deviations away from the mean

Parameters:

Parameters:	phenotype_transitions : list of length-2 tuples of str List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in data_type : ‘splicing’ \| ‘expression’ Which data type to calculate this on. (default=’splicing’) n : int Minimum number of samples per phenotype, per event
Returns:	big_transitions : pandas.DataFrame A (n_events, n_transitions) dataframe of the NMF distances between splicing events

phenotype_transitions : list of length-2 tuples of str

List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in

data_type : ‘splicing’ | ‘expression’

Which data type to calculate this on. (default=’splicing’)

n : int

Minimum number of samples per phenotype, per event

Returns:

big_transitions : pandas.DataFrame

A (n_events, n_transitions) dataframe of the NMF distances between splicing events

celltype_event_counts[source]¶: Number of cells that detected each event, per celltype

celltype_sizes(data_type='splicing')[source]¶

compute_expression_splicing_covariance()[source]¶

default_feature_set_ids = ['not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'variant', 'all features']¶

default_feature_subsets[source]¶

default_sample_subsets[source]¶

detect_outliers(data_type='expression', sample_subset=None, feature_subset=None, featurewise=False, reducer=None, standardize=None, reducer_kwargs=None, bins=None, outlier_detection_method=None, outlier_detection_method_kwargs=None)[source]¶

drop_outliers()[source]¶: Assign samples marked as “outlier” in metadata, to other datas

expression_vs_inconsistent_splicing(bins=None)[source]¶

Percentage of events inconsistent with pooled at expression threshs

Parameters:

Parameters:	bins : list-like List of expression cutoffs
Returns:	expression_vs_inconsistent : pd.DataFrame A (len(bins), n_phenotypes) dataframe of the percentage of events in single cells that are inconsistent with pooled

bins : list-like

List of expression cutoffs

Returns:

expression_vs_inconsistent : pd.DataFrame

A (len(bins), n_phenotypes) dataframe of the percentage of events in single cells that are inconsistent with pooled

feature_subset_to_feature_ids(data_type, feature_subset=None, rename=False)[source]¶

Given a name of a feature subset, get the associated feature ids

Parameters:

Parameters:	data_type : str A string describing the data type, e.g. “expression” feature_subset : str A string describing the subset of data type (must be already calculated)
Returns:	feature_ids : list of strings List of features ids from the specified datatype

data_type : str

A string describing the data type, e.g. “expression”

feature_subset : str

A string describing the subset of data type (must be already calculated)

Returns:

feature_ids : list of strings

List of features ids from the specified datatype

filter_splicing_on_expression(expression_thresh, sample_subset=None)[source]¶

Filter splicing events on expression values

Parameters:

Parameters:	expression_thresh : float Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.
Returns:	psi : pandas.DataFrame A (n_samples, n_features)

expression_thresh : float

Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.

Returns:

psi : pandas.DataFrame

A (n_samples, n_features)

classmethod from_datapackage(datapackage, datapackage_dir='./', load_species_data=True, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')[source]¶

Create a study object from a datapackage dictionary

Parameters:

Parameters:	datapackage : dict
Returns:	study : flotilla.Study Study object

datapackage : dict

Returns:

study : flotilla.Study

Study object

classmethod from_datapackage_file(datapackage_filename, load_species_data=True, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')[source]¶

classmethod from_datapackage_url(datapackage_url, load_species_data=True, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')[source]¶

Create a study from a url of a datapackage.json file

Parameters:

Parameters:	datapackage_url : str HTTP url of a datapackage.json file, following the specification described here: http://dataprotocols.org/data-packages/ and requiring the following data resources: metadata, expression, splicing species_data_pacakge_base_url : str Base URL to fetch species-specific _ and splicing event metadata frnm. Default ‘https://s3-us-west-2.amazonaws.com/flotilla-projects/‘
Returns:	study : Study A “study” object containing the data described in the datapackage_url file
Raises:	AttributeError If the datapackage.json file does not contain the required resources of metadata, expression, and splicing.

datapackage_url : str

HTTP url of a datapackage.json file, following the specification described here: http://dataprotocols.org/data-packages/ and requiring the following data resources: metadata, expression, splicing

species_data_pacakge_base_url : str

Base URL to fetch species-specific _ and splicing event metadata frnm. Default ‘https://s3-us-west-2.amazonaws.com/flotilla-projects/‘

Returns:

study : Study

A “study” object containing the data described in the datapackage_url file

Raises:

AttributeError

If the datapackage.json file does not contain the required resources of metadata, expression, and splicing.

go_enrichment(feature_ids, background=None, domain=None, p_value_cutoff=1000000, min_feature_size=3, min_background_size=5)[source]¶

Calculate gene ontology enrichment of provided features

Parameters:

Parameters:	feature_ids : list-like Features to calculate gene ontology enrichment on background : list-like, optional Features to use as the background domain : str or list, optional Only calculate GO enrichment for a particular GO category or subset of categories. Valid domains: ‘biological_process’, ‘molecular_function’, ‘cellular_component’ p_value_cutoff : float, optional Maximum accepted Bonferroni-corrected p-value min_feature_size : int, optional Minimum number of features of interest overlapping in a GO Term, to calculate enrichment min_background_size : int, optional Minimum number of features in the background overlapping a GO Term Returns ——- enrichment : pandas.DataFrame A (go_categories, columns) dataframe showing the GO enrichment categories that were enriched in the features

feature_ids : list-like

Features to calculate gene ontology enrichment on

background : list-like, optional

Features to use as the background

domain : str or list, optional

Only calculate GO enrichment for a particular GO category or subset of categories. Valid domains: ‘biological_process’, ‘molecular_function’, ‘cellular_component’

p_value_cutoff : float, optional

Maximum accepted Bonferroni-corrected p-value

min_feature_size : int, optional

Minimum number of features of interest overlapping in a GO Term, to calculate enrichment

min_background_size : int, optional

Minimum number of features in the background overlapping a GO Term

Returns

——-

enrichment : pandas.DataFrame

A (go_categories, columns) dataframe showing the GO enrichment categories that were enriched in the features

initializers = {'splicing_data': <class 'flotilla.data_model.splicing.SplicingData'>, 'expression_data': <class 'flotilla.data_model.expression.ExpressionData'>, 'mapping_stats_data': <class 'flotilla.data_model.quality_control.MappingStatsData'>, 'spikein_data': <class 'flotilla.data_model.expression.SpikeInData'>, 'metadata_data': <class 'flotilla.data_model.metadata.MetaData'>}¶

interactive_choose_outliers(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, featurewise=False, x_pc=(1, 3), y_pc=(1, 3), show_point_labels=False, kernel=('rbf', 'linear', 'poly', 'sigmoid'), gamma=(0, 25), nu=(0.1, 9.9))¶

interactive_classifier(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, categorical_variables=None, predictor_types=None, score_coefficient=(0.1, 20), draw_labels=False)¶

interactive_clustermap()¶

interactive_correlations()¶

interactive_graph(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, featurewise=False, cov_std_cut=(0.1, 3), degree_cut=(0, 10), n_pcs=(2, 100), draw_labels=False, feature_of_interest='RBFOX2', weight_fun=None, use_pc_1=True, use_pc_2=True, use_pc_3=True, use_pc_4=True, savefile='figures/last.graph.pdf')¶

interactive_lavalamp_pooled_inconsistent(sample_subsets=None, feature_subsets=None, difference_threshold=(0.001, 1.0), colors=['red', 'green', 'blue', 'purple', 'yellow'], savefile='')¶

interactive_pca(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, color_samples_by=None, featurewise=False, x_pc=(1, 10), y_pc=(1, 10), show_point_labels=False, list_link='', plot_violins=False, scale_by_variance=True, savefile='figures/last.pca.pdf')¶

interactive_reset_outliers()¶: User selects from columns that start with ‘outlier_‘ to merge multiple outlier classifications

jsd()[source]¶

Performs Jensen-Shannon Divergence on both splicing and expression study_data

Jensen-Shannon divergence is a method of quantifying the amount of change in distribution of one measurement (e.g. a splicing event or a gene expression) from one celltype to another.

static load_species_data(species, readers, species_datapackage_base_url='https://s3-us-west-2.amazonaws.com/flotilla-projects')[source]¶

static maybe_make_directory(filename)[source]¶

modality_assignments(sample_subset=None, feature_subset=None, expression_thresh=-inf, min_samples=0.5)[source]¶

Get modality assignments of splicing data

Parameters:

Parameters:	sample_subset : str or None, optional Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. feature_subset : str or None, optional Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. expression_thresh : float, optional Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.
Returns:	modalities : pandas.DataFrame A (n_phenotypes, n_events) shaped DataFrame of the assigned modality

sample_subset : str or None, optional

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None, optional

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float, optional

Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.

Returns:

modalities : pandas.DataFrame

A (n_phenotypes, n_events) shaped DataFrame of the assigned modality

modality_counts(sample_subset=None, feature_subset=None, expression_thresh=-inf, min_samples=0.5)[source]¶

Get number of splicing events in modality categories

Parameters:

Parameters:	sample_subset : str or None, optional Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. feature_subset : str or None, optional Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. expression_thresh : float, optional Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.
Returns:	modalities : pandas.DataFrame A (n_phenotypes, n_modalities) shaped DataFrame of the number of events assigned to each modality

sample_subset : str or None, optional

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None, optional

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float, optional

Minimum expression value, of the original input. E.g. if the original input is already log-transformed, then this threshold is on the log values.

Returns:

modalities : pandas.DataFrame

A (n_phenotypes, n_modalities) shaped DataFrame of the number of events assigned to each modality

nmf_space_positions(data_type='splicing')[source]¶

nmf_space_transitions(phenotype_transitions='all', data_type='splicing', n=0.5)[source]¶

The change in NMF space of splicing events across phenotypes

Parameters:

Parameters:	phenotype_transitions : list of length-2 tuples of str List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in data_type : ‘splicing’ \| ‘expression’ Which data type to calculate this on. (default=’splicing’) n : int Minimum number of samples per phenotype, per event
Returns:	big_transitions : pandas.DataFrame A (n_events, n_transitions) dataframe of the NMF distances between splicing events

phenotype_transitions : list of length-2 tuples of str

List of (‘phenotype1’, ‘phenotype2’) transitions whose change in distribution you are interested in

data_type : ‘splicing’ | ‘expression’

Which data type to calculate this on. (default=’splicing’)

n : int

Minimum number of samples per phenotype, per event

Returns:

big_transitions : pandas.DataFrame

A (n_events, n_transitions) dataframe of the NMF distances between splicing events

normalize_to_spikein()[source]¶

percent_pooled_inconsistent(sample_subset=None, feature_subset=None, fraction_diff_thresh=0.1, expression_thresh=-inf)[source]¶

percent_unique_celltype_events(n=1)[source]¶

phenotype_col[source]¶

phenotype_color_ordered[source]¶

phenotype_order[source]¶

phenotype_to_color[source]¶

phenotype_to_marker[source]¶

phenotype_transitions[source]¶

plot_big_nmf_space_transitions(data_type='expression', n=5)[source]¶

plot_classifier(trait, sample_subset=None, feature_subset='all_genes', data_type='expression', title='', show_point_labels=False, **kwargs)[source]¶

Plot a predictor for the specified data type and trait(s)

Parameters:

Parameters:	data_type : str One of the names of the data types, e.g. “expression” or “splicing” trait : str Column name in the metadata data that you would like to classify on

data_type : str

One of the names of the data types, e.g. “expression” or “splicing”

trait : str

Column name in the metadata data that you would like to classify on

plot_clustermap(sample_subset=None, feature_subset=None, data_type='expression', metric='euclidean', method='average', figsize=None, scale_fig_by_data=True, **kwargs)[source]¶: Visualize hierarchical relationships within samples and features

plot_correlations(sample_subset=None, feature_subset=None, data_type='expression', metric='euclidean', method='average', figsize=None, featurewise=False, scale_fig_by_data=True, **kwargs)[source]¶: Visualize clustered correlations of samples across features

plot_event(feature_id, sample_subset=None, nmf_space=False)[source]¶: Plot the violinplot and NMF transitions of a splicing event

plot_event_modality_estimation(event_id, sample_subset=None, expression_thresh=-inf)[source]¶

plot_expression_vs_inconsistent_splicing(bins=None)[source]¶

plot_gene(feature_id, sample_subset=None, nmf_space=False)[source]¶

plot_graph(data_type='expression', sample_subset=None, feature_subset=None, featurewise=False, **kwargs)[source]¶

Plot the graph (network) of these data

Parameters:

Parameters:	data_type : str One of the names of the data types, e.g. “expression” or “splicing” sample_subset : str or None Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. feature_subset : str or None Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

data_type : str

One of the names of the data types, e.g. “expression” or “splicing”

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

plot_lavalamp_pooled_inconsistent(sample_subset=None, feature_subset=None, fraction_diff_thresh=0.1, expression_thresh=-inf)[source]¶

plot_lavalamps(sample_subset=None, feature_subset=None, expression_thresh=-inf)[source]¶

plot_modalities_bars(sample_subset=None, feature_subset=None, expression_thresh=-inf, percentages=True)[source]¶

Make grouped barplots of the number of modalities per phenotype

Parameters:

Parameters:	sample_subset : str or None Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. feature_subset : str or None Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. expression_thresh : float If greater than -inf, then filter on splicing events in genes with expression at least this value percentages : bool If True, plot percentages instead of counts

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float

If greater than -inf, then filter on splicing events in genes with expression at least this value

percentages : bool

If True, plot percentages instead of counts

plot_modalities_lavalamps(sample_subset=None, feature_subset=None, expression_thresh=-inf)[source]¶

Plot each modality in each celltype on a separate axes

Parameters:

Parameters:	sample_subset : str or None Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. feature_subset : str or None Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. expression_thresh : float If greater than -inf, then filter on splicing events in genes with expression at least this value

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float

If greater than -inf, then filter on splicing events in genes with expression at least this value

plot_modalities_reduced(sample_subset=None, feature_subset=None, expression_thresh=-inf)[source]¶

Plot splicing events with modality assignments in NMF space

This will plot a separate NMF space for each celltype in the data, as well as one for all samples.

Parameters:

Parameters:	sample_subset : str or None Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. feature_subset : str or None Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. expression_thresh : float If greater than -inf, then filter on splicing events in genes with expression at least this value

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

expression_thresh : float

If greater than -inf, then filter on splicing events in genes with expression at least this value

plot_pca(data_type='expression', x_pc=1, y_pc=2, sample_subset=None, feature_subset=None, title='', featurewise=False, plot_violins=False, show_point_labels=False, reduce_kwargs=None, color_samples_by=None, bokeh=False, most_variant_features=False, std_multiplier=2, scale_by_variance=True, **kwargs)[source]¶

Performs DataFramePCA on both expression and splicing study_data

Parameters:

Parameters:	data_type : str One of the names of the data types, e.g. “expression” or “splicing” (default “expression”) x_pc : int, optional Which principal component to plot on the x-axis (default 1) y_pc : int, optional Which principal component to plot on the y-axis (default 2) sample_subset : str or None Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. (default None) feature_subset : str or None Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. (default None) title : str, optional Title of the reduced space plot (default ‘’) featurewise : bool, optional If True, the features are reduced on the samples, and the plotted points are features, not samples. (default False) plot_violins : bool Whether or not to make the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA. (default False) show_point_labels : bool, optional Whether or not to show the labels of the points. If this is samplewise (default), then this labels the samples. If this is featurewise, then this labels the features. (default False) reduce_kwargs : dict, optional Keyword arguments to the reducer (default None) color_samples_by : str, optional Instead of coloring the samples by their phenotype, color them by this column in the metadata. (default None) bokeh : bool, optional If True, plot a javascripty/interactive bokeh plot instead of a static printable figure (default False) most_variant_features : bool, optional If True, then only take the most variant of the provided features. The most variant are determined by taking the features whose variance is ``std_multiplier``standard deviations away from the mean feature variance (default False) std_multiplier : float, optional If `most_variant_features` is True, then use this as a cutoff for the minimum variance of a feature to be included (default 2) scale_by_variance : bool, optional If True, then scale the x- and y-axes by the explained variance ratio of the principal component dimensions. Only valid for PCA and its variations, not for NMF or tSNE. (default True) kwargs : other keyword arguments All other keyword arguments are passed to `DecomopsitionViz.plot()`

data_type : str

One of the names of the data types, e.g. “expression” or “splicing” (default “expression”)

x_pc : int, optional

Which principal component to plot on the x-axis (default 1)

y_pc : int, optional

Which principal component to plot on the y-axis (default 2)

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used. (default None)

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used. (default None)

title : str, optional

Title of the reduced space plot (default ‘’)

featurewise : bool, optional

If True, the features are reduced on the samples, and the plotted points are features, not samples. (default False)

plot_violins : bool

Whether or not to make the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA. (default False)

show_point_labels : bool, optional

Whether or not to show the labels of the points. If this is samplewise (default), then this labels the samples. If this is featurewise, then this labels the features. (default False)

reduce_kwargs : dict, optional

Keyword arguments to the reducer (default None)

color_samples_by : str, optional

Instead of coloring the samples by their phenotype, color them by this column in the metadata. (default None)

bokeh : bool, optional

If True, plot a javascripty/interactive bokeh plot instead of a static printable figure (default False)

most_variant_features : bool, optional

If True, then only take the most variant of the provided features. The most variant are determined by taking the features whose variance is ``std_multiplier``standard deviations away from the mean feature variance (default False)

std_multiplier : float, optional

If most_variant_features is True, then use this as a cutoff for the minimum variance of a feature to be included (default 2)

scale_by_variance : bool, optional

If True, then scale the x- and y-axes by the explained variance ratio of the principal component dimensions. Only valid for PCA and its variations, not for NMF or tSNE. (default True)

kwargs : other keyword arguments

All other keyword arguments are passed to DecomopsitionViz.plot()

plot_regressor(data_type='expression', **kwargs)[source]¶

plot_two_features(feature1, feature2, data_type='expression', **kwargs)[source]¶

Make a scatterplot of two features’ data

Parameters:

Parameters:	feature1 : str Name of the feature to plot on the x-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320” feature2 : str Name of the feature to plot on the y-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320”

feature1 : str

Name of the feature to plot on the x-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320”

feature2 : str

Name of the feature to plot on the y-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320”

plot_two_samples(sample1, sample2, data_type='expression', **kwargs)[source]¶

Plot a scatterplot of two samples’ data

Parameters:

Parameters:	sample1 : str Name of the sample to plot on the x-axis sample2 : str Name of the sample to plot on the y-axis data_type : “expression” \| “splicing” Type of data to plot. Default “expression” Any other keyword arguments valid for seaborn.jointplot
Returns:	jointgrid : seaborn.axisgrid.JointGrid Returns a JointGrid instance See Also seaborn.jointplot

sample1 : str

Name of the sample to plot on the x-axis

sample2 : str

Name of the sample to plot on the y-axis

data_type : “expression” | “splicing”

Type of data to plot. Default “expression”

Any other keyword arguments valid for seaborn.jointplot

Returns:

jointgrid : seaborn.axisgrid.JointGrid

Returns a JointGrid instance