flotilla.data_model.study module

Data models for “studies” studies include attributes about the data and are heavier in terms of data load

class flotilla.data_model.study.Study(sample_metadata, version='0.1.0', expression_data=None, expression_feature_data=None, expression_feature_rename_col=None, expression_feature_ignore_subset_cols=None, expression_log_base=None, expression_thresh=-inf, expression_plus_one=False, splicing_data=None, splicing_feature_data=None, splicing_feature_rename_col=None, splicing_feature_ignore_subset_cols=None, mapping_stats_data=None, mapping_stats_number_mapped_col=None, mapping_stats_min_reads=500000.0, spikein_data=None, spikein_feature_data=None, drop_outliers=True, species=None, gene_ontology_data=None, predictor_config_manager=None, metadata_pooled_col='pooled', metadata_phenotype_col='phenotype', metadata_phenotype_order=None, metadata_phenotype_to_color=None, metadata_phenotype_to_marker=None, license=None, title=None, sources=None, default_sample_subset='all_samples', default_feature_subset='variant', metadata_minimum_samples=0)[source]

Bases: object

A biological study, with associated metadata, expression, and splicing data.

Construct a biological study

This class only accepts data, no filenames. All data must already have been read in and exist as Python objects.

Parameters:

sample_metadata : pandas.DataFrame

The only required parameter. Samples as the index, with features as columns. Required column: “phenotype”. If there is a boolean column “pooled”, this will be used to separate pooled from single cells. Similarly, the column “outliers” will also be used to separate outlier cells from the rest.

version : str

A string describing the semantic version of the data. Must be in: major.minor.patch format, as the “patch” number will be increased if you change something in the study and then study.save() it. (default “0.1.0”)

expression_data : pandas.DataFrame

Samples x feature dataframe of gene expression measurements, e.g. from an RNA-Seq or a microarray experiment. Assumed to be log-transformed, i.e. you took the log of it. (default None)

expression_feature_data : pandas.DatFrame

Features x annotations dataframe describing other parameters of the gene expression features, e.g. mapping Ensembl IDs to gene symbols or gene biotypes. (default None)

expression_feature_rename_col : str

A column name in the expression_feature_data dataframe that you’d like to rename the expression features to, in the plots. For example, if your gene IDs are Ensembl IDs, but you want to plot UCSC IDs, make sure the column you want, e.g. “ucsc_id” is in your dataframe and specify that. (default “gene_name”)

expression_log_base : float

If you want to log-transform your expression data (and it’s not already log-transformed), use this number as the base of the transform. E.g. expression_log_base=10 will take the log10 of your data. (default None)

thresh : float

Minimum (non log-transformed) expression value. (default -inf)

expression_plus_one : bool

Whether or not to add 1 to the expression data. (default False)

splicing_data : pandas.DataFrame

Samples x feature dataframe of percent spliced in scores, e.g. as measured by the program MISO. Assumed that these values only fall between 0 and 1.

splicing_feature_data : pandas.DataFrame

features x other_features dataframe describing other parameters of the splicing features, e.g. mapping MISO IDs to Ensembl IDs or gene symbols or transcript types

splicing_feature_rename_col : str

A column name in the splicing_feature_data dataframe that you’d like to rename the splicing features to, in the plots. For example, if your splicing IDs are MISO IDs, but you want to plot Ensembl IDs, make sure the column you want, e.g. “ensembl_id” is in your dataframe and specify that. Default “gene_name”.

mapping_stats_data : pandas.DataFrame

Samples x feature dataframe of mapping stats measurements. Currently, this

mapping_stats_number_mapped_col : str

A column name in the mapping_stats_data which specifies the number of (uniquely or not) mapped reads. Default “Uniquely mapped reads number”

spikein_data : pandas.DataFrame

samples x features DataFrame of spike-in expression values

spikein_feature_data : pandas.DataFrame

Features x other_features dataframe, e.g. of the molecular concentration of particular spikein transcripts

drop_outliers : bool

Whether or not to drop samples indicated as outliers in the sample_metadata from the other data, i.e. with a column named ‘outlier’ in sample_metadata, then remove those samples from expression_data for further analysis

species : str

Name of the species and genome version, e.g. ‘hg19’ or ‘mm10’.

gene_ontology_data : pandas.DataFrame

Gene ids x ontology categories dataframe used for GO analysis.

metadata_pooled_col : str

Column in metadata_data which specifies as a boolean whether or not this sample was pooled.

celltype_event_counts[source]

Number of cells that detected this event in that celltype

celltype_modalities[source]

Return modality assignments of each celltype

celltype_sizes(data_type='splicing')[source]
compute_expression_splicing_covariance()[source]
default_feature_set_ids = ['not (gene_type: snoRNA)', 'gene_type: IG_V_gene', 'not (transcription_factor)', 'gene_type: IG_V_pseudogene', 'not (gene_type: snRNA)', 'gene_type: lincRNA', 'not (transcript_type: misc_RNA)', 'gene_type: snoRNA', 'transcript_type: snRNA', 'not (level: 2)', 'not (gene_status: NOVEL)', 'not (gene_type: pseudogene)', 'not (gene_type: protein_coding)', 'not (transcript_type: protein_coding)', 'not (transcript_type: sense_intronic)', 'ribosomal', 'not (gene_status: KNOWN)', 'biomark_neural_panel', 'not (gene_type: miRNA)', 'gene_type: snRNA', 'not (gene_type: sense_intronic)', 'transcript_type: IG_V_gene', 'transcript_type: pseudogene', 'not (biomark_neural_panel)', 'transcript_type: protein_coding', 'not (tag: pseudo_consens)', 'transcript_type: lincRNA', 'not (transcript_type: lincRNA)', 'not (ribosomal_subunit)', 'tag: ncRNA_host', 'not (level: 1)', 'not (transcript_status: NOVEL)', 'gene_status: NOVEL', 'transcription_factor', 'not (rbp)', 'not (gene_type: IG_V_gene)', 'not (transcript_type: snoRNA)', 'not (transcript_status: KNOWN)', 'transcript_status: NOVEL', 'not (gene_type: lincRNA)', 'gene_type: misc_RNA', 'not (transcript_type: IG_V_pseudogene)', 'transcript_type: snoRNA', 'transcript_status: KNOWN', 'confident_rbp', 'not (gene_type: IG_V_pseudogene)', 'variant', 'synapse', 'all features', 'gene_type: miRNA', 'gene_type: sense_intronic', 'not (transcript_type: IG_V_gene)', 'gene_type: antisense', 'not (transcript_type: snRNA)', 'not (tag: ncRNA_host)', 'gene_type: protein_coding', 'gene_status: KNOWN', 'transcript_type: misc_RNA', 'not (level: 3)', 'not (transcript_type: pseudogene)', 'not (transcript_type: antisense)', 'not (gene_type: antisense)', 'ribosomal_subunit', 'not (gene_type: misc_RNA)', 'level: 1', 'level: 2', 'level: 3', 'gene_type: pseudogene', 'transcript_type: antisense', 'not (transcript_type: miRNA)', 'rbp', 'not (synapse)', 'not (ribosomal)', 'not (confident_rbp)', 'transcript_type: miRNA', 'transcript_type: sense_intronic', 'tag: pseudo_consens', 'transcript_type: IG_V_pseudogene', 'not (gene_type: snoRNA)', 'gene_type: IG_V_gene', 'not (transcription_factor)', 'gene_type: IG_V_pseudogene', 'not (gene_type: snRNA)', 'gene_type: lincRNA', 'not (transcript_type: misc_RNA)', 'gene_type: snoRNA', 'transcript_type: snRNA', 'not (level: 2)', 'not (gene_status: NOVEL)', 'not (gene_type: pseudogene)', 'not (gene_type: protein_coding)', 'not (transcript_type: protein_coding)', 'not (transcript_type: sense_intronic)', 'ribosomal', 'not (gene_status: KNOWN)', 'biomark_neural_panel', 'not (gene_type: miRNA)', 'gene_type: snRNA', 'not (gene_type: sense_intronic)', 'transcript_type: IG_V_gene', 'transcript_type: pseudogene', 'not (biomark_neural_panel)', 'transcript_type: protein_coding', 'not (tag: pseudo_consens)', 'transcript_type: lincRNA', 'not (transcript_type: lincRNA)', 'not (ribosomal_subunit)', 'tag: ncRNA_host', 'not (level: 1)', 'not (transcript_status: NOVEL)', 'gene_status: NOVEL', 'transcription_factor', 'not (rbp)', 'not (gene_type: IG_V_gene)', 'not (transcript_type: snoRNA)', 'not (transcript_status: KNOWN)', 'transcript_status: NOVEL', 'not (gene_type: lincRNA)', 'gene_type: misc_RNA', 'not (transcript_type: IG_V_pseudogene)', 'transcript_type: snoRNA', 'transcript_status: KNOWN', 'confident_rbp', 'not (gene_type: IG_V_pseudogene)', 'variant', 'synapse', 'all features', 'gene_type: miRNA', 'gene_type: sense_intronic', 'not (transcript_type: IG_V_gene)', 'gene_type: antisense', 'not (transcript_type: snRNA)', 'not (tag: ncRNA_host)', 'gene_type: protein_coding', 'gene_status: KNOWN', 'transcript_type: misc_RNA', 'not (level: 3)', 'not (transcript_type: pseudogene)', 'not (transcript_type: antisense)', 'not (gene_type: antisense)', 'ribosomal_subunit', 'not (gene_type: misc_RNA)', 'level: 1', 'level: 2', 'level: 3', 'gene_type: pseudogene', 'transcript_type: antisense', 'not (transcript_type: miRNA)', 'rbp', 'not (synapse)', 'not (ribosomal)', 'not (confident_rbp)', 'transcript_type: miRNA', 'transcript_type: sense_intronic', 'tag: pseudo_consens', 'transcript_type: IG_V_pseudogene', 'not (gene_type: snoRNA)', 'gene_type: IG_V_gene', 'not (transcription_factor)', 'gene_type: IG_V_pseudogene', 'not (gene_type: snRNA)', 'gene_type: lincRNA', 'not (transcript_type: misc_RNA)', 'gene_type: snoRNA', 'transcript_type: snRNA', 'not (level: 2)', 'not (gene_status: NOVEL)', 'not (gene_type: pseudogene)', 'not (gene_type: protein_coding)', 'not (transcript_type: protein_coding)', 'not (transcript_type: sense_intronic)', 'ribosomal', 'not (gene_status: KNOWN)', 'biomark_neural_panel', 'not (gene_type: miRNA)', 'gene_type: snRNA', 'not (gene_type: sense_intronic)', 'transcript_type: IG_V_gene', 'transcript_type: pseudogene', 'not (biomark_neural_panel)', 'transcript_type: protein_coding', 'not (tag: pseudo_consens)', 'transcript_type: lincRNA', 'not (transcript_type: lincRNA)', 'not (ribosomal_subunit)', 'tag: ncRNA_host', 'not (level: 1)', 'not (transcript_status: NOVEL)', 'gene_status: NOVEL', 'transcription_factor', 'not (rbp)', 'not (gene_type: IG_V_gene)', 'not (transcript_type: snoRNA)', 'not (transcript_status: KNOWN)', 'transcript_status: NOVEL', 'not (gene_type: lincRNA)', 'gene_type: misc_RNA', 'not (transcript_type: IG_V_pseudogene)', 'transcript_type: snoRNA', 'transcript_status: KNOWN', 'confident_rbp', 'not (gene_type: IG_V_pseudogene)', 'variant', 'synapse', 'all features', 'gene_type: miRNA', 'gene_type: sense_intronic', 'not (transcript_type: IG_V_gene)', 'gene_type: antisense', 'not (transcript_type: snRNA)', 'not (tag: ncRNA_host)', 'gene_type: protein_coding', 'gene_status: KNOWN', 'transcript_type: misc_RNA', 'not (level: 3)', 'not (transcript_type: pseudogene)', 'not (transcript_type: antisense)', 'not (gene_type: antisense)', 'ribosomal_subunit', 'not (gene_type: misc_RNA)', 'level: 1', 'level: 2', 'level: 3', 'gene_type: pseudogene', 'transcript_type: antisense', 'not (transcript_type: miRNA)', 'rbp', 'not (synapse)', 'not (ribosomal)', 'not (confident_rbp)', 'transcript_type: miRNA', 'transcript_type: sense_intronic', 'tag: pseudo_consens', 'transcript_type: IG_V_pseudogene', 'not (gene_type: snoRNA)', 'gene_type: IG_V_gene', 'not (transcription_factor)', 'gene_type: IG_V_pseudogene', 'not (gene_type: snRNA)', 'gene_type: lincRNA', 'not (transcript_type: misc_RNA)', 'gene_type: snoRNA', 'transcript_type: snRNA', 'not (level: 2)', 'not (gene_status: NOVEL)', 'not (gene_type: pseudogene)', 'not (gene_type: protein_coding)', 'not (transcript_type: protein_coding)', 'not (transcript_type: sense_intronic)', 'ribosomal', 'not (gene_status: KNOWN)', 'biomark_neural_panel', 'not (gene_type: miRNA)', 'gene_type: snRNA', 'not (gene_type: sense_intronic)', 'transcript_type: IG_V_gene', 'transcript_type: pseudogene', 'not (biomark_neural_panel)', 'transcript_type: protein_coding', 'not (tag: pseudo_consens)', 'transcript_type: lincRNA', 'not (transcript_type: lincRNA)', 'not (ribosomal_subunit)', 'tag: ncRNA_host', 'not (level: 1)', 'not (transcript_status: NOVEL)', 'gene_status: NOVEL', 'transcription_factor', 'not (rbp)', 'not (gene_type: IG_V_gene)', 'not (transcript_type: snoRNA)', 'not (transcript_status: KNOWN)', 'transcript_status: NOVEL', 'not (gene_type: lincRNA)', 'gene_type: misc_RNA', 'not (transcript_type: IG_V_pseudogene)', 'transcript_type: snoRNA', 'transcript_status: KNOWN', 'confident_rbp', 'not (gene_type: IG_V_pseudogene)', 'variant', 'synapse', 'all features', 'gene_type: miRNA', 'gene_type: sense_intronic', 'not (transcript_type: IG_V_gene)', 'gene_type: antisense', 'not (transcript_type: snRNA)', 'not (tag: ncRNA_host)', 'gene_type: protein_coding', 'gene_status: KNOWN', 'transcript_type: misc_RNA', 'not (level: 3)', 'not (transcript_type: pseudogene)', 'not (transcript_type: antisense)', 'not (gene_type: antisense)', 'ribosomal_subunit', 'not (gene_type: misc_RNA)', 'level: 1', 'level: 2', 'level: 3', 'gene_type: pseudogene', 'transcript_type: antisense', 'not (transcript_type: miRNA)', 'rbp', 'not (synapse)', 'not (ribosomal)', 'not (confident_rbp)', 'transcript_type: miRNA', 'transcript_type: sense_intronic', 'tag: pseudo_consens', 'transcript_type: IG_V_pseudogene', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)', 'not (gene_category: LPS Response)', 'gene_category: LPS Response', 'variant', 'all features', 'gene_category: Housekeeping', 'not (gene_category: Housekeeping)']
default_feature_subsets[source]
default_sample_subsets[source]
detect_outliers(data_type='expression', sample_subset=None, feature_subset=None, featurewise=False, reducer=None, standardize=None, reducer_kwargs=None, bins=None, outlier_detection_method=None, outlier_detection_method_kwargs=None)[source]
drop_outliers()[source]

remove samples labeled “outlier” in self.metadata, replace the data in self.expression and self.splicing with the smaller version

feature_subset_to_feature_ids(data_type, feature_subset=None, rename=False)[source]

Given a name of a feature subset, get the associated feature ids

Parameters:

data_type : str

A string describing the data type, e.g. “expression”

feature_subset : str

A string describing the subset of data type (must be already calculated)

Returns:

feature_ids : list of strings

List of features ids from the specified datatype

classmethod from_datapackage(datapackage, datapackage_dir='./', load_species_data=True, species_datapackage_base_url='http://sauron.ucsd.edu/flotilla_projects')[source]

Create a study object from a datapackage dictionary

Parameters:

datapackage : dict

Returns:

study : flotilla.Study

Study object

classmethod from_datapackage_file(datapackage_filename, load_species_data=True, species_datapackage_base_url='http://sauron.ucsd.edu/flotilla_projects')[source]
classmethod from_datapackage_url(datapackage_url, load_species_data=True, species_data_package_base_url='http://sauron.ucsd.edu/flotilla_projects')[source]

Create a study from a url of a datapackage.json file

Parameters:

datapackage_url : str

HTTP url of a datapackage.json file, following the specification described here: http://dataprotocols.org/data-packages/ and requiring the following data resources: metadata, expression, splicing

species_data_pacakge_base_url : str

Base URL to fetch species-specific gene and splicing event metadata from. Default ‘http://sauron.ucsd.edu/flotilla_projects

Returns:

study : Study

A “study” object containing the data described in the datapackage_url file

Raises:

AttributeError

If the datapackage.json file does not contain the required resources of metadata, expression, and splicing.

initializers = {'splicing_data': <class 'flotilla.data_model.splicing.SplicingData'>, 'expression_data': <class 'flotilla.data_model.expression.ExpressionData'>, 'mapping_stats_data': <class 'flotilla.data_model.quality_control.MappingStatsData'>, 'spikein_data': <class 'flotilla.data_model.expression.SpikeInData'>, 'metadata_data': <class 'flotilla.data_model.metadata.MetaData'>}
interactive_choose_outliers(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, featurewise=False, x_pc=(1, 3), y_pc=(1, 3), show_point_labels=False, kernel=['rbf', 'linear', 'poly', 'sigmoid'], gamma=(0, 25), nu=(0.1, 9.9))
interactive_classifier(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, categorical_variables=None, predictor_types=None, score_coefficient=(0.1, 20), draw_labels=False, savefile='')
interactive_graph(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, featurewise=False, cov_std_cut=(0.1, 3), degree_cut=(0, 10), n_pcs=(2, 100), draw_labels=False, feature_of_interest='RBFOX2', weight_fun=None, use_pc_1=True, use_pc_2=True, use_pc_3=True, use_pc_4=True, savefile='')
interactive_lavalamp_pooled_inconsistent(sample_subsets=None, feature_subsets=None, difference_threshold=(0.001, 1.0), colors=['red', 'green', 'blue', 'purple', 'yellow'], savefile='')
interactive_pca(data_types=('expression', 'splicing'), sample_subsets=None, feature_subsets=None, featurewise=False, x_pc=(1, 10), y_pc=(1, 10), show_point_labels=False, list_link='', plot_violins=False, savefile='')
interactive_reset_outliers()

User selects from columns that start with ‘outlier_‘ to merge multiple outlier classifications

jsd()[source]

Performs Jensen-Shannon Divergence on both splicing and expression study_data

Jensen-Shannon divergence is a method of quantifying the amount of change in distribution of one measurement (e.g. a splicing event or a gene expression) from one celltype to another.

static load_species_data(species, readers, species_datapackage_base_url='http://sauron.ucsd.edu/flotilla_projects')[source]
static maybe_make_directory(filename)[source]
modalities(sample_subset=None, feature_subset=None)[source]

Get modality assignments of

normalize_to_spikein()[source]
percent_pooled_inconsistent(sample_subset=None, feature_ids=None, fraction_diff_thresh=0.1)[source]
percent_unique_celltype_events(n=1)[source]
plot_big_nmf_space_transitions(data_type='expression')[source]
plot_classifier(trait, sample_subset=None, feature_subset='all_genes', data_type='expression', title='', show_point_labels=False, **kwargs)[source]

Plot a predictor for the specified data type and trait(s)

Parameters:

data_type : str

One of the names of the data types, e.g. “expression” or “splicing”

trait : str

Column name in the metadata data that you would like to classify on

plot_event(feature_id, sample_subset=None, nmf_space=False)[source]

Plot the violinplot and DataFrameNMF transitions of a splicing event

plot_gene(feature_id, sample_subset=None, nmf_space=False)[source]
plot_graph(data_type='expression', sample_subset=None, feature_subset=None, featurewise=False, **kwargs)[source]

Plot the graph (network) of these data

Parameters:

data_type : str

One of the names of the data types, e.g. “expression” or “splicing”

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

plot_lavalamp_pooled_inconsistent(sample_subset=None, feature_ids=None, fraction_diff_thresh=0.1)[source]
plot_modalities(sample_subset=None, feature_subset=None, normed=True)[source]
plot_modalities_lavalamps(sample_subset=None, bootstrapped=False, bootstrapped_kws=None)[source]
plot_pca(data_type='expression', x_pc=1, y_pc=2, sample_subset=None, feature_subset=None, title='', featurewise=False, plot_violins=True, show_point_labels=False, reduce_kwargs=None, **kwargs)[source]

Performs DataFramePCA on both expression and splicing study_data

Parameters:

data_type : str

One of the names of the data types, e.g. “expression” or “splicing”

x_pc : int

Which principal component to plot on the x-axis

y_pc : int

Which principal component to plot on the y-axis

sample_subset : str or None

Which subset of the samples to use, based on some phenotype column in the experiment design data. If None, all samples are used.

feature_subset : str or None

Which subset of the features to used, based on some feature type in the expression data (e.g. “variant”). If None, all features are used.

title : str

The title of the plot

plot_violins : bool

Whether or not to make the violinplots of the top features. This can take a long time, so to save time you can turn it off if you just want a quick look at the PCA.

show_point_labels : bool

Whether or not to show the labels of the points. If this is samplewise (default), then this labels the samples. If this is featurewise, then this labels the features.

plot_regressor(data_type='expression', **kwargs)[source]
plot_study_sample_legend()[source]
plot_two_features(feature1, feature2, data_type='expression', **kwargs)[source]

Make a scatterplot of two features’ data

Parameters:

feature1 : str

Name of the feature to plot on the x-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320”

feature2 : str

Name of the feature to plot on the y-axis. If you have a feature_data dataframe for this data type, will attempt to map the common name, e.g. “RBFOX2” back to the crazy name, e.g. “ENSG00000100320”

plot_two_samples(sample1, sample2, data_type='expression', **kwargs)[source]

Plot a scatterplot of two samples’ data

Parameters:

sample1 : str

Name of the sample to plot on the x-axis

sample2 : str

Name of the sample to plot on the y-axis

data_type : “expression” | “splicing”

Type of data to plot. Default “expression”

Any other keyword arguments valid for seaborn.jointplot

Returns:

jointgrid : seaborn.axisgrid.JointGrid

Returns a JointGrid instance

See Also

seaborn.jointplot

readers = {'tsv': <function load_tsv at 0x2ba7e5fda578>, 'json': <function load_json at 0x2ba7e5fda5f0>, 'csv': <function load_csv at 0x2ba7e5fda6e0>, 'gzip_pickle_df': <function load_gzip_pickle_df at 0x2ba7e5fda488>, 'pickle_df': <function load_pickle_df at 0x2ba7e5fda398>}
sample_subset_to_sample_ids(phenotype_subset=None)[source]

Convert a string naming a subset of phenotypes in the data into sample ids

Parameters:

phenotype_subset : str

A valid string describing a boolean phenotype described in the metadata data

Returns:

sample_ids : list of strings

List of sample ids in the data

save(name, flotilla_dir='/home/travis/flotilla_projects')[source]
unique_celltype_event_counts(n=1)[source]
Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.