flotilla.data_model.gene_ontology module

class flotilla.data_model.gene_ontology.GeneOntologyData(data)[source]

Bases: object

Object to calculate enrichment of Gene Ontology terms

Acceptable Gene Ontology tables can be downloaded from ENSEMBL’s BioMart tool: http://www.ensembl.org/biomart

  1. Choose “Ensembl Genes ##” (## = version number, for me it’s 78)
  2. Click “Attributes”
  3. Expand “EXTERNAL”
  4. Check the boxes for ‘GO Term Accession’, ‘Ensembl Gene ID’, ‘GO Term Name’, and ‘GO domain’
Parameters:

data : pandas.DataFrame

A dataframe with at least the following columns: ‘GO Term Accession’, ‘Ensembl Gene ID’, ‘GO Term Name’, ‘GO domain’

domains = frozenset(['molecular_function', 'cellular_component', 'biological_process'])
enrichment(features_of_interest, background=None, p_value_cutoff=1000000, cross_reference=None, min_feature_size=3, min_background_size=5, domain=None)[source]

Bonferroni-corrected hypergeometric p-values of GO enrichment

Calculates hypergeometric enrichment of the features of interest, in each GO category.

Parameters:

features_of_interest : list-like

List of features. Must match the identifiers in the ontology database exactly, i.e. if your ontology database is ENSEMBL ids, then you can only provide those and not common names like “RBFOX2”

background : list-like, optional

Background genes to use. It is best to use a relevant background such as all expressed genes. If None, defaults to all genes.

p_value_cutoff : float, optional

Maximum accepted Bonferroni-corrected p-value

cross_reference : dict-like, optional

A mapping of gene ids to gene symbols, e.g. a pandas Series of ENSEMBL genes e.g. ENSG00000139675 to gene symbols e.g HNRNPA1L2

min_feature_size : int, optional

Minimum number of features of interest overlapping in a GO Term, to calculate enrichment

min_background_size : int, optional

Minimum number of features in the background overlapping a GO Term

domain : str or list, optional

Only calculate GO enrichment for a particular GO category or subset of categories. Valid domains: ‘biological_process’, ‘molecular_function’, ‘cellular_component’

Returns:

enrichment_df : pandas.DataFrame

A (n_go_categories, columns) DataFrame of the enrichment scores

Raises:

ValueError

If features of interest and background do not overlap, or invalid GO domains are given

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.