flotilla.compute.infotheory module

Information-theoretic calculations

flotilla.compute.infotheory.bin_range_strings(bins)[source]

Given a list of bins, make a list of strings of those bin ranges

Parameters:

bins : list_like

List of anything, usually values of bin edges

Returns:

bin_ranges : list

List of bin ranges

>>> bin_range_strings((0, 0.5, 1))

[‘0-0.5’, ‘0.5-1’]

flotilla.compute.infotheory.binify(df, bins)[source]

Makes a histogram of each column the provided binsize

Parameters:

data : pandas.DataFrame

A samples x features dataframe. Each feature (column) will be binned into the provided bins

bins : iterable

Bins you would like to use for this data. Must include the final bin value, e.g. (0, 0.5, 1) for the two bins (0, 0.5) and (0.5, 1). nbins = len(bins) - 1

Returns:

binned : pandas.DataFrame

An nbins x features DataFrame of each column binned across rows

flotilla.compute.infotheory.entropy(binned, base=2)[source]

Find the entropy of each column of a dataframe

Parameters:

binned : pandas.DataFrame

A nbins x features DataFrame of probability distributions, where each column sums to 1

base : numeric

The log-base of the entropy. Default is 2, so the resulting entropy is in bits.

Returns:

entropy : pandas.Seires

Entropy values for each column of the dataframe.

flotilla.compute.infotheory.jsd(p, q)[source]

Finds the per-column JSD betwen dataframes p and q

Jensen-Shannon divergence of two probability distrubutions pandas dataframes, p and q. These distributions are usually created by running binify() on the dataframe.

Parameters:

p : pandas.DataFrame

An nbins x features DataFrame.

q : pandas.DataFrame

An nbins x features DataFrame.

Returns:

jsd : pandas.Series

Jensen-Shannon divergence of each column with the same names between p and q

flotilla.compute.infotheory.kld(p, q)[source]

Kullback-Leiber divergence of two probability distributions pandas dataframes, p and q

Parameters:

p : pandas.DataFrame

An nbins x features DataFrame

q : pandas.DataFrame

An nbins x features DataFrame

Returns:

kld : pandas.Series

Kullback-Lieber divergence of the common columns between the dataframe. E.g. between 1st column in p and 1st column in q, and 2nd column in p and 2nd column in q.

Notes

The input to this function must be probability distributions, not raw values. Otherwise, the output makes no sense.

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.