flotilla.compute.generic module¶

flotilla.compute.generic.apply_calc_robust(*args, **kwargs)[source]¶

Calculate robust regression between the columns of X and y

Parameters:

Parameters:	X : pandas.DataFrame A (n_samples, n_features) Dataframe of the predictor variable y : pandas.DataFrame A (n_samples, m_features) DataFrame of the response variable verbose : bool, optional If True, output status messages as the calculation is happening
Returns:	out_I : pandas.Series Intercept of regressions out_S : pandas.Series Slope of regressions out_T : pandas.Series t-statistic of regressions out_P : pandas.Series p-values of regressions

X : pandas.DataFrame

A (n_samples, n_features) Dataframe of the predictor variable

y : pandas.DataFrame

A (n_samples, m_features) DataFrame of the response variable

verbose : bool, optional

If True, output status messages as the calculation is happening

Returns:

out_I : pandas.Series

Intercept of regressions

out_S : pandas.Series

Slope of regressions

out_T : pandas.Series

t-statistic of regressions

out_P : pandas.Series

p-values of regressions

See also

get_robust_values: This is the underlying function which calculates the slope, intercept, t-value, and p-value of the fit

flotilla.compute.generic.apply_calc_rs(*args, **kwargs)[source]¶

Apply R calculation method on each column of X versus the values of y

Parameters:

Parameters:	X : pandas.DataFrame A (n_samples, n_features) sized DataFrame, assumed to be of log-normal expression values y : pandas.Series A (n_samples,) sized Series, assumed to be of percent spliced-in alternative splicing scores method : function, optional Which correlation method to use on each feature in X versus the values in y
Returns:	r_coefficients : pandas.Series Correlation coefficients p_values : pandas.Series Correlation significances (smaller is better)

X : pandas.DataFrame

A (n_samples, n_features) sized DataFrame, assumed to be of log-normal expression values

y : pandas.Series

A (n_samples,) sized Series, assumed to be of percent spliced-in alternative splicing scores

method : function, optional

Which correlation method to use on each feature in X versus the values in y

Returns:

r_coefficients : pandas.Series

Correlation coefficients

p_values : pandas.Series

Correlation significances (smaller is better)

See also

do_r: This is the underlying function which calculates correlation

flotilla.compute.generic.apply_calc_slope(*args, **kwargs)[source]¶

X and y are dataframes, returns slope, t-value and p-value of robust regression

Parameters:

Parameters:	X : pandas.DataFrame A (n_samples, n_features) Dataframe of predictor variable values y : pandas.DataFrame A (n_samples, m_features) Dataframe of response variable values verbose : bool, optional If True, output status messages
Returns:	slope : pandas.Series Slopes of the linear regression

X : pandas.DataFrame

A (n_samples, n_features) Dataframe of predictor variable values

y : pandas.DataFrame

A (n_samples, m_features) Dataframe of response variable values

verbose : bool, optional

If True, output status messages

Returns:

slope : pandas.Series

Slopes of the linear regression

See also

get_slope: This is the underlying function which calculates the slope

flotilla.compute.generic.apply_dcor(*args, **kwargs)[source]¶

Calcualte distance correlation between the columns of two dataframes

Parameters:

Parameters:	X : pandas.DataFrame A (n_samples, n_features) Dataframe of predictor variable values y : pandas.DataFrame A (n_samples, m_features) Dataframe of response variable values verbose : bool, optional If True, output status messages
Returns:	dc : pandas.Series Distance covariance dr : pandas.Series Distance correlation dvx : pandas.Series Distance variance of x dvy : pandas.Series Distance variance of y

X : pandas.DataFrame

A (n_samples, n_features) Dataframe of predictor variable values

y : pandas.DataFrame

A (n_samples, m_features) Dataframe of response variable values

verbose : bool, optional

If True, output status messages

Returns:

dc : pandas.Series

Distance covariance

dr : pandas.Series

Distance correlation

dvx : pandas.Series

Distance variance of x

dvy : pandas.Series

Distance variance of y

See also

get_dcor: This is the underlying function that gets called to calculate the distance correlation

flotilla.compute.generic.do_r(*args, **kwargs)[source]¶

Calculate correlation (“R-value”) between two vectors

Parameters:

Parameters:	s_1 : pandas.Series Predictor vector s_2 : pandas.Series Target vector method : function, optional Which correlation method to use. (default scipy.stats.pearsonr) min_items : int, optional Minimum number of items occuring in both s_1 and s_2 (default 12)
Returns:	r_value : float R-value of the correlation, i.e. how correlated the two inputs are p_value : float p-value of the correlation, i.e. how likely this correlation would happen given the null hypothesis that the two are not correlated

s_1 : pandas.Series

Predictor vector

s_2 : pandas.Series

Target vector

method : function, optional

Which correlation method to use. (default scipy.stats.pearsonr)

min_items : int, optional

Minimum number of items occuring in both s_1 and s_2 (default 12)

Returns:

r_value : float

R-value of the correlation, i.e. how correlated the two inputs are

p_value : float

p-value of the correlation, i.e. how likely this correlation would happen given the null hypothesis that the two are not correlated

Notes

If too few items overlap, return (np.nan, np.nan)

flotilla.compute.generic.dropna_mean(x)[source]¶: Drop NA values and return the mean

flotilla.compute.generic.get_boosting_regressor(x, y, verbose=False)[source]¶

Calculate a GradientBoostingRegressor on predictor and target variables

Parameters:

Parameters:	x : numpy.array Predictor variable y : numpy.array Target variable verbose : bool, optional If True, output status messages
Returns:	classifier : sklearn.ensemble.GradientBoostingRegressor A fitted classifier of the predictor and target variable

x : numpy.array

Predictor variable

y : numpy.array

Target variable

verbose : bool, optional

If True, output status messages

Returns:

classifier : sklearn.ensemble.GradientBoostingRegressor

A fitted classifier of the predictor and target variable

flotilla.compute.generic.get_dcor(*args, **kwargs)[source]¶

Calculate distance correlation between two vectors

Uses the distance correlation package from: https://github.com/andrewdyates/dcor

Parameters:

Parameters:	x : numpy.array 1-dimensional array (aka a vector) of the independent, predictor variable y : numpy.array 1-dimensional array (aka a vector) of the dependent, target variable
Returns:	dc : float Distance covariance dr : float Distance correlation dvx : float Distance variance on x dvy : float Distance variance on y

x : numpy.array

1-dimensional array (aka a vector) of the independent, predictor variable

y : numpy.array

1-dimensional array (aka a vector) of the dependent, target variable

Returns:

dc : float

Distance covariance

dr : float

Distance correlation

dvx : float

Distance variance on x

dvy : float

Distance variance on y

flotilla.compute.generic.get_regressor(x, y, n_estimators=1500, n_tries=5, verbose=False)[source]¶

Calculate an ExtraTreesRegressor on predictor and target variables

Parameters:

Parameters:	x : numpy.array Predictor vector y : numpy.array Target vector n_estimators : int, optional Number of estimators to use n_tries : int, optional Number of attempts to calculate regression verbose : bool, optional If True, output progress statements
Returns:	classifier : sklearn.ensemble.ExtraTreesRegressor The classifier with the highest out of bag scores of all the attempted “tries” oob_scores : numpy.array Out of bag scores of the classifier

x : numpy.array

Predictor vector

y : numpy.array

Target vector

n_estimators : int, optional

Number of estimators to use

n_tries : int, optional

Number of attempts to calculate regression

verbose : bool, optional

If True, output progress statements

Returns:

classifier : sklearn.ensemble.ExtraTreesRegressor

The classifier with the highest out of bag scores of all the attempted “tries”

oob_scores : numpy.array

Out of bag scores of the classifier

flotilla.compute.generic.get_robust_values(*args, **kwargs)[source]¶

Calculate robust linear regression

Parameters:

Parameters:	x : numpy.array Predictor vector y : numpy.array Target vector
Returns:	intercept : float Intercept of the fitted line slope : float Slope of the fitted line t_statistic : float T-statistic of the fit p_value : float p-value of the fit

x : numpy.array

Predictor vector

y : numpy.array

Target vector

Returns:

intercept : float

Intercept of the fitted line

slope : float

Slope of the fitted line

t_statistic : float

T-statistic of the fit

p_value : float

p-value of the fit

flotilla.compute.generic.get_slope(*args, **kwargs)[source]¶

Get the linear regression slope of x and y

Parameters:

Parameters:	x : numpy.array X-values of data y : numpy.array Y-values of data
Returns:	slope : float Scipy.stats.linregress slope

x : numpy.array

X-values of data

y : numpy.array

Y-values of data

Returns:

slope : float

Scipy.stats.linregress slope

flotilla.compute.generic.get_unstarted_events(mongodb)[source]¶

get events that have not been started yet. generator sets started to True before returning an event

Parameters:

Parameters:	mongodb : pymongo.Database A MongoDB database object

mongodb : pymongo.Database

A MongoDB database object

flotilla.compute.generic.spearmanr_dataframe(A, B, axis=0)[source]¶

Calculate spearman correlations between dataframes A and B

Parameters:

Parameters:	A : pandas.DataFrame A n_samples x n_features1 dataframe. Must have the same number of rows as “B” B : pandas.DataFrame A n_samples x n_features2 Dataframe. Must have the same number of rows as “A” axis : int Which axis to compare. If 0, calculate correlations between all the columns of A vs te columns of B. If 1, calculate between rows. (default 0)
Returns:	correlations : pandas.DataFrame A n_features2 x n_features1 DataFrame of (spearman_r, spearman_p) tuples

A : pandas.DataFrame

A n_samples x n_features1 dataframe. Must have the same number of rows as “B”

B : pandas.DataFrame

A n_samples x n_features2 Dataframe. Must have the same number of rows as “A”

axis : int

Which axis to compare. If 0, calculate correlations between all the columns of A vs te columns of B. If 1, calculate between rows. (default 0)

Returns:

correlations : pandas.DataFrame

A n_features2 x n_features1 DataFrame of (spearman_r, spearman_p) tuples

Notes

Use “applymap” to get just the R- and p-values of the resulting dataframe

>>> import pandas as pd
>>> import numpy as np
>>> A = pd.DataFrame(np.random.randn(100).reshape(5, 20))
>>> B = pd.DataFrame(np.random.randn(55).reshape(5, 11))
>>> correls = spearmanr_dataframe(A, B)
>>> correls.shape
(11, 20)
>>> spearman_r = correls.applymap(lambda x: x[0])
>>> spearman_p = correls.applymap(lambda x: x[1])

flotilla.compute.generic.spearmanr_series(x, y)[source]¶

Calculate spearman r (with p-values) between two pandas series

Parameters:

Parameters:	x : pandas.Series One of the two series you’d like to correlate y : pandas.Series The other series you’d like to correlate
Returns:	r_value : float The R-value of the correlation. 1 for perfect positive correlation, and -1 for perfect negative correlation p_value : float The p-value of the correlation.

x : pandas.Series

One of the two series you’d like to correlate

y : pandas.Series

The other series you’d like to correlate

Returns:

r_value : float

The R-value of the correlation. 1 for perfect positive correlation, and -1 for perfect negative correlation

p_value : float

The p-value of the correlation.