Storing supplemental data on `Study` objects¶

A recently added feature is the ability to store any arbitrary pandas dataframe on study.supplemental, and this will get re-loaded every time you embark on that datapackage. Let’s start with the batch-corrected BrainSpan Allen Brain Institute’s Brain Atlas data.

import flotilla
study = flotilla.embark(flotilla._brainspan)

Creating a directory for saving your flotilla projects: /home/travis/flotilla_projects
Creating a directory for saving the data for this project: /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/datapackage.json has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/datapackage.json
2015-06-09 22:42:57 Parsing datapackage to create a Study object
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression_feature.csv has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression_feature.csv
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression.csv has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression.csv
https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/metadata.csv has not been downloaded before.
    Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/metadata.csv
2015-06-09 22:43:12 Initializing Study
2015-06-09 22:43:12 Initializing Predictor configuration manager for Study
2015-06-09 22:43:12 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2015-06-09 22:43:12 Added ExtraTreesClassifier to default predictors
2015-06-09 22:43:12 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2015-06-09 22:43:12 Added ExtraTreesRegressor to default predictors
2015-06-09 22:43:12 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2015-06-09 22:43:12 Added GradientBoostingClassifier to default predictors
2015-06-09 22:43:12 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2015-06-09 22:43:12 Added GradientBoostingRegressor to default predictors
2015-06-09 22:43:12 Loading metadata
2015-06-09 22:43:12 Loading expression data
2015-06-09 22:43:12 Initializing expression
2015-06-09 22:43:13 Done initializing expression
2015-06-09 22:43:16 Successfully initialized a Study object!

Let’s take a look at how big this expression matrix is.

study.expression.data.shape

(493, 14321)

Yikes, 14,321 features is a lot! Let’s subset on just the most variant genes. By default, this is the genes whose variance is two standard deviations away from the mean variance of all genes.

variant_ids = study.expression.feature_subsets['variant']
variant_expression = study.expression.data.ix[:, variant_ids]
variant_expression.shape

(493, 553)

564 features isn’t so bad. Let’s correlate all features to each other in this subset.

%%time
variant_expression_corr = variant_expression.corr()
variant_expression_corr.head()

CPU times: user 393 ms, sys: 0 ns, total: 393 ms
Wall time: 392 ms

That didn’t take too long, but I’m sure you can imagine it would take a really long time for ALL genes!

Now let’s assign this to the study.supplemental object with a name of our choice. To keep things simple, I’m gonna give it the same name.

study.supplemental.variant_expression_corr = variant_expression_corr

Now let’s save the object and re-embark to make sure it’s there.

study.save('brainspan2')
study2 = flotilla.embark('brainspan2')

Wrote datapackage to /home/travis/flotilla_projects/brainspan2/datapackage.json
2015-06-09 22:44:04 Reading datapackage from /home/travis/flotilla_projects/brainspan2/datapackage.json
2015-06-09 22:44:04 Parsing datapackage to create a Study object
2015-06-09 22:44:10 Initializing Study
2015-06-09 22:44:10 Initializing Predictor configuration manager for Study
2015-06-09 22:44:10 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2015-06-09 22:44:10 Added ExtraTreesClassifier to default predictors
2015-06-09 22:44:10 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2015-06-09 22:44:10 Added ExtraTreesRegressor to default predictors
2015-06-09 22:44:10 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2015-06-09 22:44:10 Added GradientBoostingClassifier to default predictors
2015-06-09 22:44:10 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2015-06-09 22:44:10 Added GradientBoostingRegressor to default predictors
2015-06-09 22:44:10 Loading metadata
2015-06-09 22:44:11 Loading expression data
2015-06-09 22:44:11 Initializing expression
2015-06-09 22:44:13 Done initializing expression
2015-06-09 22:44:16 Successfully initialized a Study object!

Let’s make sure our variant_expression_corr dataframe is really there.

study2.supplemental.variant_expression_corr.head()

	ENSG00000003137	ENSG00000004848	ENSG00000006016	ENSG00000006116	ENSG00000006128	ENSG00000006377	ENSG00000007350	ENSG00000016082	ENSG00000041353	ENSG00000041982	...	ENSG00000258283	ENSG00000258403	ENSG00000258444	ENSG00000258518	ENSG00000258752	ENSG00000259190	ENSG00000259279	ENSG00000259373	ENSG00000259410	ENSG00000259603
ENSG00000003137	1.000000	-0.046835	0.090661	0.053573	-0.047665	-0.155271	0.054222	-0.160111	-0.115487	-0.044074	...	-0.224868	-0.019139	-0.121898	-0.392903	-0.122529	0.273103	0.136641	0.448154	0.445432	0.041884
ENSG00000004848	-0.046835	1.000000	0.612271	0.707699	0.558755	0.671949	0.028770	0.148974	0.448605	0.001084	...	0.701812	0.328554	0.040970	-0.478688	-0.082581	-0.196485	0.135765	-0.662072	-0.570054	-0.623628
ENSG00000006016	0.090661	0.612271	1.000000	0.652296	0.585003	0.450687	0.037159	0.111290	0.414487	0.022681	...	0.650395	0.300988	0.233828	-0.391661	0.055337	-0.289575	0.060468	-0.418865	-0.398856	-0.348691
ENSG00000006116	0.053573	0.707699	0.652296	1.000000	0.516889	0.424020	-0.185337	-0.071044	0.469453	-0.299232	...	0.687966	0.526633	-0.027971	-0.569975	-0.209932	-0.020450	0.220832	-0.560163	-0.465723	-0.655024
ENSG00000006128	-0.047665	0.558755	0.585003	0.516889	1.000000	0.715297	0.000925	0.415798	0.564539	0.053909	...	0.716306	0.429350	0.264993	-0.256103	0.155573	-0.431776	-0.079680	-0.378022	-0.389113	-0.255839

5 rows × 553 columns

Yay, it’s here!

Storing supplemental data on Study objects¶

Storing supplemental data on `Study` objects¶