flotilla.data_model.splicing module

class flotilla.data_model.splicing.DownsampledSplicingData(df, sample_descriptors)[source]

Bases: flotilla.data_model.base.BaseData

Instantiate an object of downsampled splicing data

Parameters:

df : pandas.DataFrame

A “tall” dataframe of all miso summary events, with the usual MISO summary columns, and these are required: ‘splice_type’, ‘probability’, ‘iteration.’ Where “probability” indicates the randomly sampling probability from the bam file used to generate these reads, and “iteration” indicates the integer iteration performed, e.g. if multiple resamplings were performed.

experiment_design_data: pandas.DataFrame

Notes

Warning: this data is usually HUGE (we’re taking like 10GB raw .tsv files) so make sure you have the available memory for dealing with these.

binned_reducer = None
n_components = 2
raw_reducer = None
shared_events[source]
Returns:

event_count_df : pandas.DataFrame

Splicing events on the rows, splice types and probability as column MultiIndex. Values are the number of iterations which share this splicing event at that probability and splice type.

shared_events_barplot(figure_dir='./')[source]

PLot a “histogram” via colored bars of the number of events shared by different iterations at a particular sampling probability

Parameters:

figure_dir : str

Where to save the pdf figures created

shared_events_percentage(min_iter_shared=5, figure_dir='./')[source]

Plot the percentage of all events detected at that iteration, shared by at least ‘min_iter_shared’

Parameters:

min_iter_shared : int

Minimum number of iterations sharing an event

figure_dir : str

Where to save the pdf figures created

class flotilla.data_model.splicing.SpliceJunctionData(df, phenotype_data)[source]

Bases: flotilla.data_model.splicing.SplicingData

Class to hold splice junction information from SJ.out.tab files from STAR

Constructor for SpliceJunctionData

Parameters:data, experiment_design_data
class flotilla.data_model.splicing.SplicingData(data, feature_data=None, binsize=0.1, outliers=None, feature_rename_col=None, feature_ignore_subset_cols=None, excluded_max=0.2, included_min=0.8, pooled=None, predictor_config_manager=None, technical_outliers=None, minimum_samples=0)[source]

Bases: flotilla.data_model.base.BaseData

Instantiate a object for percent spliced in (PSI) scores

Parameters:

data : pandas.DataFrame

A [n_events, n_samples] dataframe of data events

n_components : int

Number of components to use in the reducer

binsize : float

Value between 0 and 1, the bin size for binning the study_data scores

excluded_max : float

Maximum value for the “excluded” bin of psi scores. Default 0.2.

included_max : float

Minimum value for the “included” bin of psi scores. Default 0.8.

Notes

‘thresh’ from BaseData is not used.

binify(data)[source]
binned_reducer = None
modalities(*args, **kwargs)[source]

Assigned modalities for these samples and features.

Parameters:

sample_ids : list of str

Which samples to use. If None, use all. Default None.

feature_ids : list of str

Which features to use. If None, use all. Default None.

bootstrapped : bool

Whether or not to use bootstrapping, i.e. resample each splicing event several times to get a better estimate of its true modality.

bootstrappped_kws : dict

Valid arguments to _bootstrapped_fit_transform. If None, default is dict(n_iter=100, thresh=0.6, minimum_samples=10)

Returns:

modality_assignments : pandas.Series

The modality assignments of each feature given these samples

modalities_counts(*args, **kwargs)[source]

Count the number of each modalities of these samples and features

Parameters:

sample_ids : list of str

Which samples to use. If None, use all. Default None.

feature_ids : list of str

Which features to use. If None, use all. Default None.

bootstrapped : bool

Whether or not to use bootstrapping, i.e. resample each splicing event several times to get a better estimate of its true modality. Default False.

bootstrappped_kws : dict

Valid arguments to _bootstrapped_fit_transform. If None, default is dict(n_iter=100, thresh=0.6, minimum_samples=10)

Returns:

modalities_counts : pandas.Series

The number of events detected in each modality

n_components = 2
percent_pooled_inconsistent(*args, **kwargs)[source]

The percent of splicing events which are

plot_feature(feature_id, sample_ids=None, phenotype_groupby=None, phenotype_order=None, color=None, phenotype_to_color=None, phenotype_to_marker=None, xlabel=None, ylabel=None, nmf_space=False)[source]
plot_hist_single_vs_pooled_diff(sample_ids, feature_ids=None, color=None, title='', hist_kws=None)[source]
plot_lavalamp_pooled_inconsistent(sample_ids, feature_ids=None, fraction_diff_thresh=0.1, color=None)[source]
plot_modalities_bar(sample_ids=None, feature_ids=None, ax=None, i=0, normed=True, legend=True, bootstrapped=False, bootstrapped_kws=None)[source]

Plot stacked bar graph of each modality

Parameters:

bootstrapped : bool

Whether or not to use bootstrapping, i.e. resample each splicing event several times to get a better estimate of its true modality. Default False.

bootstrappped_kws : dict

Valid arguments to _bootstrapped_fit_transform. If None, default is dict(n_iter=100, thresh=0.6, minimum_samples=10)

plot_modalities_lavalamps(sample_ids=None, feature_ids=None, color=None, x_offset=0, use_these_modalities=True, bootstrapped=False, bootstrapped_kws=None, ax=None)[source]

Plot “lavalamp” scatterplot of each event

Parameters:

sample_ids : None or list of str

Which samples to use. If None, use all

feature_ids : None or list of str

Which features to use. If None, use all

color : None or matplotlib color

Which color to use for plotting the lavalamps of these features and samples

x_offset : numeric

How much to offset the x-axis of each event. Useful if you want to plot the same event, but in several iterations with different celltypes or colors

axes : None or list of matplotlib.axes.Axes objects

Which axes to plot these on

use_these_modalities : bool

If True, then use these sample ids to calculate modalities. Otherwise, use the modalities assigned using ALL samples and features

bootstrapped : bool

Whether or not to use bootstrapping, i.e. resample each splicing event several times to get a better estimate of its true modality. Default False.

bootstrappped_kws : dict

Valid arguments to _bootstrapped_fit_transform. If None, default is dict(n_iter=100, thresh=0.6, minimum_samples=10)

plot_modalities_reduced(sample_ids=None, feature_ids=None, ax=None, title=None, bootstrapped=False, bootstrapped_kws=None)[source]

Plot modality assignments in DataFrameNMF space (option for lavalamp?)

Parameters:

bootstrapped : bool

Whether or not to use bootstrapping, i.e. resample each splicing event several times to get a better estimate of its true modality. Default False.

bootstrappped_kws : dict

Valid arguments to _bootstrapped_fit_transform. If None, default is dict(n_iter=100, thresh=0.6, minimum_samples=10)

pooled_inconsistent(*args, **kwargs)[source]

Return splicing events which pooled samples are consistently different from the single cells.

Parameters:

singles_ids : list-like

List of sample ids of single cells (in the main ”.data” DataFrame)

pooled_ids : list-like

List of sample ids of pooled cells (in the other ”.pooled” DataFrame)

feature_ids : None or list-like

List of feature ids. If None, use all

fraction_diff_thresh : float

Returns:

large_diff : pandas.DataFrame

All splicing events which have a scaled difference larger than the fraction diff thresh

raw_reducer = None
reduce(sample_ids=None, feature_ids=None, featurewise=False, reducer=None, standardize=False, reducer_kwargs=None, bins=None)[source]
Parameters:
  • sample_ids – list of sample ids
  • feature_ids – list of features
  • featurewise – reduce transpose (feature X sample) instead of sample X feature
  • reducer – DataFrameReducer object, defaults to DataFramePCA
  • standardize – standardize columns before reduction
  • reducer_kwargs – kwargs for reducer
  • bins – bins to use for binify
Returns:

reducer object

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.