Storing supplemental data on Study objects

A recently added feature is the ability to store any arbitrary pandas dataframe on study.supplemental, and this will get re-loaded every time you embark on that datapackage. Let’s start with the batch-corrected BrainSpan Allen Brain Institute’s Brain Atlas data.

import flotilla
study = flotilla.embark(flotilla._brainspan)
Creating a directory for saving your flotilla projects: /home/travis/flotilla_projects
Creating a directory for saving the data for this project: /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/datapackage.json has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/datapackage.json
2015-06-09 22:42:57 Parsing datapackage to create a Study object
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression_feature.csv has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression_feature.csv
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression.csv has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression.csv
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/metadata.csv has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/metadata.csv
2015-06-09 22:43:12 Initializing Study
2015-06-09 22:43:12 Initializing Predictor configuration manager for Study
2015-06-09 22:43:12 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2015-06-09 22:43:12 Added ExtraTreesClassifier to default predictors
2015-06-09 22:43:12 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2015-06-09 22:43:12 Added ExtraTreesRegressor to default predictors
2015-06-09 22:43:12 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2015-06-09 22:43:12 Added GradientBoostingClassifier to default predictors
2015-06-09 22:43:12 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2015-06-09 22:43:12 Added GradientBoostingRegressor to default predictors
2015-06-09 22:43:12 Loading metadata
2015-06-09 22:43:12 Loading expression data
2015-06-09 22:43:12 Initializing expression
2015-06-09 22:43:13 Done initializing expression
2015-06-09 22:43:16 Successfully initialized a Study object!

Let’s take a look at how big this expression matrix is.

study.expression.data.shape
(493, 14321)

Yikes, 14,321 features is a lot! Let’s subset on just the most variant genes. By default, this is the genes whose variance is two standard deviations away from the mean variance of all genes.

variant_ids = study.expression.feature_subsets['variant']
variant_expression = study.expression.data.ix[:, variant_ids]
variant_expression.shape
(493, 553)

564 features isn’t so bad. Let’s correlate all features to each other in this subset.

%%time
variant_expression_corr = variant_expression.corr()
variant_expression_corr.head()
CPU times: user 393 ms, sys: 0 ns, total: 393 ms
Wall time: 392 ms

That didn’t take too long, but I’m sure you can imagine it would take a really long time for ALL genes!

Now let’s assign this to the study.supplemental object with a name of our choice. To keep things simple, I’m gonna give it the same name.

study.supplemental.variant_expression_corr = variant_expression_corr

Now let’s save the object and re-embark to make sure it’s there.

study.save('brainspan2')
study2 = flotilla.embark('brainspan2')
Wrote datapackage to /home/travis/flotilla_projects/brainspan2/datapackage.json
2015-06-09 22:44:04 Reading datapackage from /home/travis/flotilla_projects/brainspan2/datapackage.json
2015-06-09 22:44:04 Parsing datapackage to create a Study object
2015-06-09 22:44:10 Initializing Study
2015-06-09 22:44:10 Initializing Predictor configuration manager for Study
2015-06-09 22:44:10 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2015-06-09 22:44:10 Added ExtraTreesClassifier to default predictors
2015-06-09 22:44:10 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2015-06-09 22:44:10 Added ExtraTreesRegressor to default predictors
2015-06-09 22:44:10 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2015-06-09 22:44:10 Added GradientBoostingClassifier to default predictors
2015-06-09 22:44:10 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2015-06-09 22:44:10 Added GradientBoostingRegressor to default predictors
2015-06-09 22:44:10 Loading metadata
2015-06-09 22:44:11 Loading expression data
2015-06-09 22:44:11 Initializing expression
2015-06-09 22:44:13 Done initializing expression
2015-06-09 22:44:16 Successfully initialized a Study object!

Let’s make sure our variant_expression_corr dataframe is really there.

study2.supplemental.variant_expression_corr.head()
ENSG00000003137 ENSG00000004848 ENSG00000006016 ENSG00000006116 ENSG00000006128 ENSG00000006377 ENSG00000007350 ENSG00000016082 ENSG00000041353 ENSG00000041982 ... ENSG00000258283 ENSG00000258403 ENSG00000258444 ENSG00000258518 ENSG00000258752 ENSG00000259190 ENSG00000259279 ENSG00000259373 ENSG00000259410 ENSG00000259603
ENSG00000003137 1.000000 -0.046835 0.090661 0.053573 -0.047665 -0.155271 0.054222 -0.160111 -0.115487 -0.044074 ... -0.224868 -0.019139 -0.121898 -0.392903 -0.122529 0.273103 0.136641 0.448154 0.445432 0.041884
ENSG00000004848 -0.046835 1.000000 0.612271 0.707699 0.558755 0.671949 0.028770 0.148974 0.448605 0.001084 ... 0.701812 0.328554 0.040970 -0.478688 -0.082581 -0.196485 0.135765 -0.662072 -0.570054 -0.623628
ENSG00000006016 0.090661 0.612271 1.000000 0.652296 0.585003 0.450687 0.037159 0.111290 0.414487 0.022681 ... 0.650395 0.300988 0.233828 -0.391661 0.055337 -0.289575 0.060468 -0.418865 -0.398856 -0.348691
ENSG00000006116 0.053573 0.707699 0.652296 1.000000 0.516889 0.424020 -0.185337 -0.071044 0.469453 -0.299232 ... 0.687966 0.526633 -0.027971 -0.569975 -0.209932 -0.020450 0.220832 -0.560163 -0.465723 -0.655024
ENSG00000006128 -0.047665 0.558755 0.585003 0.516889 1.000000 0.715297 0.000925 0.415798 0.564539 0.053909 ... 0.716306 0.429350 0.264993 -0.256103 0.155573 -0.431776 -0.079680 -0.378022 -0.389113 -0.255839

5 rows × 553 columns

Yay, it’s here!

Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.