A recently added feature is the ability to store any arbitrary pandas dataframe on study.supplemental, and this will get re-loaded every time you embark on that datapackage. Let’s start with the batch-corrected BrainSpan Allen Brain Institute’s Brain Atlas data.
import flotilla
study = flotilla.embark(flotilla._brainspan)
Creating a directory for saving your flotilla projects: /home/travis/flotilla_projects
Creating a directory for saving the data for this project: /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/datapackage.json has not been downloaded before.
Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/datapackage.json
2015-06-09 22:42:57 Parsing datapackage to create a Study object
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression_feature.csv has not been downloaded before.
Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression_feature.csv
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression.csv has not been downloaded before.
Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression.csv
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/metadata.csv has not been downloaded before.
Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/metadata.csv
2015-06-09 22:43:12 Initializing Study
2015-06-09 22:43:12 Initializing Predictor configuration manager for Study
2015-06-09 22:43:12 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2015-06-09 22:43:12 Added ExtraTreesClassifier to default predictors
2015-06-09 22:43:12 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2015-06-09 22:43:12 Added ExtraTreesRegressor to default predictors
2015-06-09 22:43:12 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2015-06-09 22:43:12 Added GradientBoostingClassifier to default predictors
2015-06-09 22:43:12 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2015-06-09 22:43:12 Added GradientBoostingRegressor to default predictors
2015-06-09 22:43:12 Loading metadata
2015-06-09 22:43:12 Loading expression data
2015-06-09 22:43:12 Initializing expression
2015-06-09 22:43:13 Done initializing expression
2015-06-09 22:43:16 Successfully initialized a Study object!
Let’s take a look at how big this expression matrix is.
study.expression.data.shape
(493, 14321)
Yikes, 14,321 features is a lot! Let’s subset on just the most variant genes. By default, this is the genes whose variance is two standard deviations away from the mean variance of all genes.
variant_ids = study.expression.feature_subsets['variant']
variant_expression = study.expression.data.ix[:, variant_ids]
variant_expression.shape
(493, 553)
564 features isn’t so bad. Let’s correlate all features to each other in this subset.
%%time
variant_expression_corr = variant_expression.corr()
variant_expression_corr.head()
CPU times: user 393 ms, sys: 0 ns, total: 393 ms
Wall time: 392 ms
That didn’t take too long, but I’m sure you can imagine it would take a really long time for ALL genes!
Now let’s assign this to the study.supplemental object with a name of our choice. To keep things simple, I’m gonna give it the same name.
study.supplemental.variant_expression_corr = variant_expression_corr
Now let’s save the object and re-embark to make sure it’s there.
study.save('brainspan2')
study2 = flotilla.embark('brainspan2')
Wrote datapackage to /home/travis/flotilla_projects/brainspan2/datapackage.json
2015-06-09 22:44:04 Reading datapackage from /home/travis/flotilla_projects/brainspan2/datapackage.json
2015-06-09 22:44:04 Parsing datapackage to create a Study object
2015-06-09 22:44:10 Initializing Study
2015-06-09 22:44:10 Initializing Predictor configuration manager for Study
2015-06-09 22:44:10 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2015-06-09 22:44:10 Added ExtraTreesClassifier to default predictors
2015-06-09 22:44:10 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2015-06-09 22:44:10 Added ExtraTreesRegressor to default predictors
2015-06-09 22:44:10 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2015-06-09 22:44:10 Added GradientBoostingClassifier to default predictors
2015-06-09 22:44:10 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2015-06-09 22:44:10 Added GradientBoostingRegressor to default predictors
2015-06-09 22:44:10 Loading metadata
2015-06-09 22:44:11 Loading expression data
2015-06-09 22:44:11 Initializing expression
2015-06-09 22:44:13 Done initializing expression
2015-06-09 22:44:16 Successfully initialized a Study object!
Let’s make sure our variant_expression_corr dataframe is really there.
study2.supplemental.variant_expression_corr.head()
ENSG00000003137 | ENSG00000004848 | ENSG00000006016 | ENSG00000006116 | ENSG00000006128 | ENSG00000006377 | ENSG00000007350 | ENSG00000016082 | ENSG00000041353 | ENSG00000041982 | ... | ENSG00000258283 | ENSG00000258403 | ENSG00000258444 | ENSG00000258518 | ENSG00000258752 | ENSG00000259190 | ENSG00000259279 | ENSG00000259373 | ENSG00000259410 | ENSG00000259603 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000003137 | 1.000000 | -0.046835 | 0.090661 | 0.053573 | -0.047665 | -0.155271 | 0.054222 | -0.160111 | -0.115487 | -0.044074 | ... | -0.224868 | -0.019139 | -0.121898 | -0.392903 | -0.122529 | 0.273103 | 0.136641 | 0.448154 | 0.445432 | 0.041884 |
ENSG00000004848 | -0.046835 | 1.000000 | 0.612271 | 0.707699 | 0.558755 | 0.671949 | 0.028770 | 0.148974 | 0.448605 | 0.001084 | ... | 0.701812 | 0.328554 | 0.040970 | -0.478688 | -0.082581 | -0.196485 | 0.135765 | -0.662072 | -0.570054 | -0.623628 |
ENSG00000006016 | 0.090661 | 0.612271 | 1.000000 | 0.652296 | 0.585003 | 0.450687 | 0.037159 | 0.111290 | 0.414487 | 0.022681 | ... | 0.650395 | 0.300988 | 0.233828 | -0.391661 | 0.055337 | -0.289575 | 0.060468 | -0.418865 | -0.398856 | -0.348691 |
ENSG00000006116 | 0.053573 | 0.707699 | 0.652296 | 1.000000 | 0.516889 | 0.424020 | -0.185337 | -0.071044 | 0.469453 | -0.299232 | ... | 0.687966 | 0.526633 | -0.027971 | -0.569975 | -0.209932 | -0.020450 | 0.220832 | -0.560163 | -0.465723 | -0.655024 |
ENSG00000006128 | -0.047665 | 0.558755 | 0.585003 | 0.516889 | 1.000000 | 0.715297 | 0.000925 | 0.415798 | 0.564539 | 0.053909 | ... | 0.716306 | 0.429350 | 0.264993 | -0.256103 | 0.155573 | -0.431776 | -0.079680 | -0.378022 | -0.389113 | -0.255839 |
5 rows × 553 columns
Yay, it’s here!