flotilla.compute.infotheory module

Information-theoretic calculations

flotilla.compute.infotheory.bin_range_strings(bins)[source]

Given a list of bins, make a list of strings of those bin ranges

Parameters:

bins : list_like

List of anything, usually values of bin edges

Returns:

bin_ranges : list

List of bin ranges

>>> bin_range_strings((0, 0.5, 1))

[‘0-0.5’, ‘0.5-1’]

flotilla.compute.infotheory.binify(df, bins)[source]

Makes a histogram of each column the provided binsize

Parameters:

data : pandas.DataFrame

A samples x features dataframe. Each feature (column) will be binned into the provided bins

bins : iterable

Bins you would like to use for this data. Must include the final bin value, e.g. (0, 0.5, 1) for the two bins (0, 0.5) and (0.5, 1). nbins = len(bins) - 1

Returns:

binned : pandas.DataFrame

An nbins x features DataFrame of each column binned across rows

flotilla.compute.infotheory.binify_and_jsd(df1, df2, pair, bins)[source]
flotilla.compute.infotheory.cross_phenotype_jsd(data, groupby, bins, n_iter=100)[source]

Jensen-Shannon divergence of features across phenotypes

Parameters:

data : pandas.DataFrame

A (n_samples, n_features) Dataframe

groupby : mappable

A samples to phenotypes mapping

n_iter : int

Number of bootstrap resampling iterations to perform for the within-group comparisons

n_bins : int

Number of bins to binify the singles data on

Returns:

jsd_df : pandas.DataFrame

A (n_features, n_phenotypes^2) dataframe of the JSD between each feature between and within phenotypes

flotilla.compute.infotheory.entropy(binned, base=2)[source]

Find the entropy of each column of a dataframe

Parameters:

binned : pandas.DataFrame

A nbins x features DataFrame of probability distributions, where each column sums to 1

base : numeric

The log-base of the entropy. Default is 2, so the resulting entropy is in bits.

Returns:

entropy : pandas.Seires

Entropy values for each column of the dataframe.

Raises:

ValueError

If the data provided is not a probability distribution, i.e. it has negative values or its columns do not sum to 1, raise ValueError

flotilla.compute.infotheory.jsd(p, q)[source]

Finds the per-column JSD betwen dataframes p and q

Jensen-Shannon divergence of two probability distrubutions pandas dataframes, p and q. These distributions are usually created by running binify() on the dataframe.

Parameters:

p : pandas.DataFrame

An nbins x features DataFrame.

q : pandas.DataFrame

An nbins x features DataFrame.

Returns:

jsd : pandas.Series

Jensen-Shannon divergence of each column with the same names between p and q

Raises:

ValueError

If the data provided is not a probability distribution, i.e. it has negative values or its columns do not sum to 1, raise ValueError

flotilla.compute.infotheory.jsd_df_to_2d(jsd_df)[source]

Transform a tall JSD dataframe to a square matrix of mean JSDs

Parameters:

jsd_df : pandas.DataFrame

A (n_features, n_phenotypes^2) dataframe of the JSD between each feature between and within phenotypes

Returns:

jsd_2d : pandas.DataFrame

A (n_phenotypes, n_phenotypes) symmetric dataframe of the mean JSD between and within phenotypes

flotilla.compute.infotheory.kld(p, q)[source]

Kullback-Leiber divergence of two probability distributions pandas dataframes, p and q

Parameters:

p : pandas.DataFrame

An nbins x features DataFrame, or (nbins,) Series

q : pandas.DataFrame

An nbins x features DataFrame, or (nbins,) Series

Returns:

kld : pandas.Series

Kullback-Lieber divergence of the common columns between the dataframe. E.g. between 1st column in p and 1st column in q, and 2nd column in p and 2nd column in q.

Raises:

ValueError

If the data provided is not a probability distribution, i.e. it has negative values or its columns do not sum to 1, raise ValueError

Notes

The input to this function must be probability distributions, not raw values. Otherwise, the output makes no sense.

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.