Storing supplemental data on ``Study`` objects ============================================== A recently added feature is the ability to store any arbitrary pandas dataframe on ``study.supplemental``, and this will get re-loaded every time you ``embark`` on that datapackage. Let's start with the batch-corrected `BrainSpan `_ Allen Brain Institute's Brain Atlas data. .. code:: python import flotilla study = flotilla.embark(flotilla._brainspan) .. parsed-literal:: Creating a directory for saving your flotilla projects: /home/travis/flotilla_projects Creating a directory for saving the data for this project: /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/datapackage.json has not been downloaded before. Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/datapackage.json 2015-06-09 22:42:57 Parsing datapackage to create a Study object https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression_feature.csv has not been downloaded before. Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression_feature.csv https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/expression.csv has not been downloaded before. Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/expression.csv https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/metadata.csv has not been downloaded before. Downloading now to /home/travis/flotilla_projects/brainspan_filtered_and_markers_amazon/metadata.csv 2015-06-09 22:43:12 Initializing Study 2015-06-09 22:43:12 Initializing Predictor configuration manager for Study 2015-06-09 22:43:12 Predictor ExtraTreesClassifier is of type 2015-06-09 22:43:12 Added ExtraTreesClassifier to default predictors 2015-06-09 22:43:12 Predictor ExtraTreesRegressor is of type 2015-06-09 22:43:12 Added ExtraTreesRegressor to default predictors 2015-06-09 22:43:12 Predictor GradientBoostingClassifier is of type 2015-06-09 22:43:12 Added GradientBoostingClassifier to default predictors 2015-06-09 22:43:12 Predictor GradientBoostingRegressor is of type 2015-06-09 22:43:12 Added GradientBoostingRegressor to default predictors 2015-06-09 22:43:12 Loading metadata 2015-06-09 22:43:12 Loading expression data 2015-06-09 22:43:12 Initializing expression 2015-06-09 22:43:13 Done initializing expression 2015-06-09 22:43:16 Successfully initialized a Study object! Let's take a look at how big this expression matrix is. .. code:: python study.expression.data.shape .. parsed-literal:: (493, 14321) Yikes, 14,321 features is a lot! Let's subset on just the most variant genes. By default, this is the genes whose variance is two standard deviations away from the mean variance of all genes. .. code:: python variant_ids = study.expression.feature_subsets['variant'] variant_expression = study.expression.data.ix[:, variant_ids] variant_expression.shape .. parsed-literal:: (493, 553) 564 features isn't so bad. Let's correlate all features to each other in this subset. .. code:: python %%time variant_expression_corr = variant_expression.corr() variant_expression_corr.head() .. parsed-literal:: CPU times: user 393 ms, sys: 0 ns, total: 393 ms Wall time: 392 ms That didn't take *too* long, but I'm sure you can imagine it would take a really long time for ALL genes! Now let's assign this to the ``study.supplemental`` object with a name of our choice. To keep things simple, I'm gonna give it the same name. .. code:: python study.supplemental.variant_expression_corr = variant_expression_corr Now let's save the object and re-``embark`` to make sure it's there. .. code:: python study.save('brainspan2') study2 = flotilla.embark('brainspan2') .. parsed-literal:: Wrote datapackage to /home/travis/flotilla_projects/brainspan2/datapackage.json 2015-06-09 22:44:04 Reading datapackage from /home/travis/flotilla_projects/brainspan2/datapackage.json 2015-06-09 22:44:04 Parsing datapackage to create a Study object 2015-06-09 22:44:10 Initializing Study 2015-06-09 22:44:10 Initializing Predictor configuration manager for Study 2015-06-09 22:44:10 Predictor ExtraTreesClassifier is of type 2015-06-09 22:44:10 Added ExtraTreesClassifier to default predictors 2015-06-09 22:44:10 Predictor ExtraTreesRegressor is of type 2015-06-09 22:44:10 Added ExtraTreesRegressor to default predictors 2015-06-09 22:44:10 Predictor GradientBoostingClassifier is of type 2015-06-09 22:44:10 Added GradientBoostingClassifier to default predictors 2015-06-09 22:44:10 Predictor GradientBoostingRegressor is of type 2015-06-09 22:44:10 Added GradientBoostingRegressor to default predictors 2015-06-09 22:44:10 Loading metadata 2015-06-09 22:44:11 Loading expression data 2015-06-09 22:44:11 Initializing expression 2015-06-09 22:44:13 Done initializing expression 2015-06-09 22:44:16 Successfully initialized a Study object! Let's make sure our ``variant_expression_corr`` dataframe is really there. .. code:: python study2.supplemental.variant_expression_corr.head() .. raw:: html
ENSG00000003137 ENSG00000004848 ENSG00000006016 ENSG00000006116 ENSG00000006128 ENSG00000006377 ENSG00000007350 ENSG00000016082 ENSG00000041353 ENSG00000041982 ... ENSG00000258283 ENSG00000258403 ENSG00000258444 ENSG00000258518 ENSG00000258752 ENSG00000259190 ENSG00000259279 ENSG00000259373 ENSG00000259410 ENSG00000259603
ENSG00000003137 1.000000 -0.046835 0.090661 0.053573 -0.047665 -0.155271 0.054222 -0.160111 -0.115487 -0.044074 ... -0.224868 -0.019139 -0.121898 -0.392903 -0.122529 0.273103 0.136641 0.448154 0.445432 0.041884
ENSG00000004848 -0.046835 1.000000 0.612271 0.707699 0.558755 0.671949 0.028770 0.148974 0.448605 0.001084 ... 0.701812 0.328554 0.040970 -0.478688 -0.082581 -0.196485 0.135765 -0.662072 -0.570054 -0.623628
ENSG00000006016 0.090661 0.612271 1.000000 0.652296 0.585003 0.450687 0.037159 0.111290 0.414487 0.022681 ... 0.650395 0.300988 0.233828 -0.391661 0.055337 -0.289575 0.060468 -0.418865 -0.398856 -0.348691
ENSG00000006116 0.053573 0.707699 0.652296 1.000000 0.516889 0.424020 -0.185337 -0.071044 0.469453 -0.299232 ... 0.687966 0.526633 -0.027971 -0.569975 -0.209932 -0.020450 0.220832 -0.560163 -0.465723 -0.655024
ENSG00000006128 -0.047665 0.558755 0.585003 0.516889 1.000000 0.715297 0.000925 0.415798 0.564539 0.053909 ... 0.716306 0.429350 0.264993 -0.256103 0.155573 -0.431776 -0.079680 -0.378022 -0.389113 -0.255839

5 rows × 553 columns

Yay, it's here!