flotilla.compute.generic module

flotilla.compute.generic.apply_calc_robust(*args, **kwargs)[source]

Calculate robust regression between the columns of X and y

Parameters:

X : pandas.DataFrame

A (n_samples, n_features) Dataframe of the predictor variable

y : pandas.DataFrame

A (n_samples, m_features) DataFrame of the response variable

verbose : bool, optional

If True, output status messages as the calculation is happening

Returns:

out_I : pandas.Series

Intercept of regressions

out_S : pandas.Series

Slope of regressions

out_T : pandas.Series

t-statistic of regressions

out_P : pandas.Series

p-values of regressions

See also

get_robust_values
This is the underlying function which calculates the slope, intercept, t-value, and p-value of the fit
flotilla.compute.generic.apply_calc_rs(*args, **kwargs)[source]

Apply R calculation method on each column of X versus the values of y

Parameters:

X : pandas.DataFrame

A (n_samples, n_features) sized DataFrame, assumed to be of log-normal expression values

y : pandas.Series

A (n_samples,) sized Series, assumed to be of percent spliced-in alternative splicing scores

method : function, optional

Which correlation method to use on each feature in X versus the values in y

Returns:

r_coefficients : pandas.Series

Correlation coefficients

p_values : pandas.Series

Correlation significances (smaller is better)

See also

do_r
This is the underlying function which calculates correlation
flotilla.compute.generic.apply_calc_slope(*args, **kwargs)[source]

X and y are dataframes, returns slope, t-value and p-value of robust regression

Parameters:

X : pandas.DataFrame

A (n_samples, n_features) Dataframe of predictor variable values

y : pandas.DataFrame

A (n_samples, m_features) Dataframe of response variable values

verbose : bool, optional

If True, output status messages

Returns:

slope : pandas.Series

Slopes of the linear regression

See also

get_slope
This is the underlying function which calculates the slope
flotilla.compute.generic.apply_dcor(*args, **kwargs)[source]

Calcualte distance correlation between the columns of two dataframes

Parameters:

X : pandas.DataFrame

A (n_samples, n_features) Dataframe of predictor variable values

y : pandas.DataFrame

A (n_samples, m_features) Dataframe of response variable values

verbose : bool, optional

If True, output status messages

Returns:

dc : pandas.Series

Distance covariance

dr : pandas.Series

Distance correlation

dvx : pandas.Series

Distance variance of x

dvy : pandas.Series

Distance variance of y

See also

get_dcor
This is the underlying function that gets called to calculate the distance correlation
flotilla.compute.generic.do_r(*args, **kwargs)[source]

Calculate correlation (“R-value”) between two vectors

Parameters:

s_1 : pandas.Series

Predictor vector

s_2 : pandas.Series

Target vector

method : function, optional

Which correlation method to use. (default scipy.stats.pearsonr)

min_items : int, optional

Minimum number of items occuring in both s_1 and s_2 (default 12)

Returns:

r_value : float

R-value of the correlation, i.e. how correlated the two inputs are

p_value : float

p-value of the correlation, i.e. how likely this correlation would happen given the null hypothesis that the two are not correlated

Notes

If too few items overlap, return (np.nan, np.nan)

flotilla.compute.generic.dropna_mean(x)[source]

Drop NA values and return the mean

flotilla.compute.generic.get_boosting_regressor(x, y, verbose=False)[source]

Calculate a GradientBoostingRegressor on predictor and target variables

Parameters:

x : numpy.array

Predictor variable

y : numpy.array

Target variable

verbose : bool, optional

If True, output status messages

Returns:

classifier : sklearn.ensemble.GradientBoostingRegressor

A fitted classifier of the predictor and target variable

flotilla.compute.generic.get_dcor(*args, **kwargs)[source]

Calculate distance correlation between two vectors

Uses the distance correlation package from: https://github.com/andrewdyates/dcor

Parameters:

x : numpy.array

1-dimensional array (aka a vector) of the independent, predictor variable

y : numpy.array

1-dimensional array (aka a vector) of the dependent, target variable

Returns:

dc : float

Distance covariance

dr : float

Distance correlation

dvx : float

Distance variance on x

dvy : float

Distance variance on y

flotilla.compute.generic.get_regressor(x, y, n_estimators=1500, n_tries=5, verbose=False)[source]

Calculate an ExtraTreesRegressor on predictor and target variables

Parameters:

x : numpy.array

Predictor vector

y : numpy.array

Target vector

n_estimators : int, optional

Number of estimators to use

n_tries : int, optional

Number of attempts to calculate regression

verbose : bool, optional

If True, output progress statements

Returns:

classifier : sklearn.ensemble.ExtraTreesRegressor

The classifier with the highest out of bag scores of all the attempted “tries”

oob_scores : numpy.array

Out of bag scores of the classifier

flotilla.compute.generic.get_robust_values(*args, **kwargs)[source]

Calculate robust linear regression

Parameters:

x : numpy.array

Predictor vector

y : numpy.array

Target vector

Returns:

intercept : float

Intercept of the fitted line

slope : float

Slope of the fitted line

t_statistic : float

T-statistic of the fit

p_value : float

p-value of the fit

flotilla.compute.generic.get_slope(*args, **kwargs)[source]

Get the linear regression slope of x and y

Parameters:

x : numpy.array

X-values of data

y : numpy.array

Y-values of data

Returns:

slope : float

Scipy.stats.linregress slope

flotilla.compute.generic.get_unstarted_events(mongodb)[source]

get events that have not been started yet. generator sets started to True before returning an event

Parameters:

mongodb : pymongo.Database

A MongoDB database object

flotilla.compute.generic.spearmanr_dataframe(A, B, axis=0)[source]

Calculate spearman correlations between dataframes A and B

Parameters:

A : pandas.DataFrame

A n_samples x n_features1 dataframe. Must have the same number of rows as “B”

B : pandas.DataFrame

A n_samples x n_features2 Dataframe. Must have the same number of rows as “A”

axis : int

Which axis to compare. If 0, calculate correlations between all the columns of A vs te columns of B. If 1, calculate between rows. (default 0)

Returns:

correlations : pandas.DataFrame

A n_features2 x n_features1 DataFrame of (spearman_r, spearman_p) tuples

Notes

Use “applymap” to get just the R- and p-values of the resulting dataframe

>>> import pandas as pd
>>> import numpy as np
>>> A = pd.DataFrame(np.random.randn(100).reshape(5, 20))
>>> B = pd.DataFrame(np.random.randn(55).reshape(5, 11))
>>> correls = spearmanr_dataframe(A, B)
>>> correls.shape
(11, 20)
>>> spearman_r = correls.applymap(lambda x: x[0])
>>> spearman_p = correls.applymap(lambda x: x[1])
flotilla.compute.generic.spearmanr_series(x, y)[source]

Calculate spearman r (with p-values) between two pandas series

Parameters:

x : pandas.Series

One of the two series you’d like to correlate

y : pandas.Series

The other series you’d like to correlate

Returns:

r_value : float

The R-value of the correlation. 1 for perfect positive correlation, and -1 for perfect negative correlation

p_value : float

The p-value of the correlation.

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.