In the interest of reproducibility, and to showcase our new package
```flotilla`` <http://github.com/yeolab/flotilla>`_, I've reproduced
many figures from the landmark single-cell paper, `Single-cell
transcriptomics reveals bimodality in expression and splicing in immune
cells <http://www.ncbi.nlm.nih.gov/pubmed/23685454>`_ by Shalek and
Satija, *et al*. *Nature* (2013).

Before we begin, let's import everything we need.

.. code:: python

    # Turn on inline plots with IPython
    %matplotlib inline
    
    # Import the flotilla package for biological data analysis
    import flotilla
    
    # Import "numerical python" library for number crunching
    import numpy as np
    
    # Import "panel data analysis" library for tabular data
    import pandas as pd
    
    # Import statistical data visualization package
    # Note: As of November 6th, 2014, you will need the "master" version of 
    # seaborn on github (v0.5.dev), installed via 
    # "pip install git+ssh://git@github.com/mwaskom/seaborn.git
    import seaborn as sns
Shalek and Satija, *et al* (2013)
=================================

In the 2013 paper, `Single-cell transcriptomics reveals bimodality in
expression and splicing in immune
cells <http://www.ncbi.nlm.nih.gov/pubmed/23685454>`_ (Shalek and
Satija, *et al*. *Nature* (2013)), Regev and colleagues performed
single-cell sequencing 18 bone marrow-derived dendritic cells (BMDCs),
in addition to 3 pooled samples.

Expression data
---------------

First, we will read in the expression data. These data were obtained
using,

.. code:: python

    %%bash
    wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz

.. parsed-literal::

    --2015-06-09 22:42:47--  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz
               => `GSE41265_allGenesTPM.txt.gz'
    Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::13, 130.14.250.7
    Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::13|:21... connected.
    Logging in as anonymous ... Logged in!
    ==> SYST ... done.    ==> PWD ... done.
    ==> TYPE I ... done.  ==> CWD (1) /geo/series/GSE41nnn/GSE41265/suppl ... done.
    ==> SIZE GSE41265_allGenesTPM.txt.gz ... 1099290
    ==> EPSV ... done.    ==> RETR GSE41265_allGenesTPM.txt.gz ... done.
    Length: 1099290 (1.0M) (unauthoritative)
    
         0K .......... .......... .......... .......... ..........  4%  969K 1s
        50K .......... .......... .......... .......... ..........  9% 6.14M 1s
       100K .......... .......... .......... .......... .......... 13% 6.28M 0s
       150K .......... .......... .......... .......... .......... 18%  114M 0s
       200K .......... .......... .......... .......... .......... 23% 6.45M 0s
       250K .......... .......... .......... .......... .......... 27% 84.8M 0s
       300K .......... .......... .......... .......... .......... 32% 6.99M 0s
       350K .......... .......... .......... .......... .......... 37% 80.8M 0s
       400K .......... .......... .......... .......... .......... 41%  112M 0s
       450K .......... .......... .......... .......... .......... 46% 7.12M 0s
       500K .......... .......... .......... .......... .......... 51%  102M 0s
       550K .......... .......... .......... .......... .......... 55%  106M 0s
       600K .......... .......... .......... .......... .......... 60% 91.3M 0s
       650K .......... .......... .......... .......... .......... 65%  130M 0s
       700K .......... .......... .......... .......... .......... 69% 7.88M 0s
       750K .......... .......... .......... .......... .......... 74% 68.3M 0s
       800K .......... .......... .......... .......... .......... 79% 96.2M 0s
       850K .......... .......... .......... .......... .......... 83% 83.3M 0s
       900K .......... .......... .......... .......... .......... 88% 84.6M 0s
       950K .......... .......... .......... .......... .......... 93% 9.07M 0s
      1000K .......... .......... .......... .......... .......... 97% 82.5M 0s
      1050K .......... .......... ...                             100%  190M=0.1s
    
    2015-06-09 22:42:47 (9.77 MB/s) - `GSE41265_allGenesTPM.txt.gz' saved [1099290]
    

We will also compare to the supplementary table 2 data, obtained using

.. code:: python

    %%bash
    wget http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip
    unzip nature12172-s1.zip

.. parsed-literal::

    Archive:  nature12172-s1.zip
       creating: nature12172-s1/
      inflating: nature12172-s1/Supplementary_Table1.xls  
      inflating: nature12172-s1/Supplementary_Table2.xlsx  
      inflating: nature12172-s1/Supplementary_Table3.xls  
      inflating: nature12172-s1/Supplementary_Table4.xls  
      inflating: nature12172-s1/Supplementary_Table5.xls  
      inflating: nature12172-s1/Supplementary_Table6.xls  
      inflating: nature12172-s1/Supplementary_Table7.xlsx  


.. parsed-literal::

    --2015-06-09 22:42:47--  http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip
    Resolving www.nature.com (www.nature.com)... 63.233.110.66, 63.233.110.80
    Connecting to www.nature.com (www.nature.com)|63.233.110.66|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 4634226 (4.4M) [application/zip]
    Saving to: `nature12172-s1.zip'
    
         0K .......... .......... .......... .......... ..........  1% 10.8M 0s
        50K .......... .......... .......... .......... ..........  2% 22.3M 0s
       100K .......... .......... .......... .......... ..........  3% 21.5M 0s
       150K .......... .......... .......... .......... ..........  4% 21.9M 0s
       200K .......... .......... .......... .......... ..........  5% 52.7M 0s
       250K .......... .......... .......... .......... ..........  6% 7.57M 0s
       300K .......... .......... .......... .......... ..........  7%  106M 0s
       350K .......... .......... .......... .......... ..........  8% 25.1M 0s
       400K .......... .......... .......... .......... ..........  9%  109M 0s
       450K .......... .......... .......... .......... .......... 11% 26.4M 0s
       500K .......... .......... .......... .......... .......... 12%  109M 0s
       550K .......... .......... .......... .......... .......... 13% 80.5M 0s
       600K .......... .......... .......... .......... .......... 14% 41.0M 0s
       650K .......... .......... .......... .......... .......... 15% 71.4M 0s
       700K .......... .......... .......... .......... .......... 16%  113M 0s
       750K .......... .......... .......... .......... .......... 17% 95.8M 0s
       800K .......... .......... .......... .......... .......... 18% 45.1M 0s
       850K .......... .......... .......... .......... .......... 19%  122M 0s
       900K .......... .......... .......... .......... .......... 20% 8.95M 0s
       950K .......... .......... .......... .......... .......... 22% 67.4M 0s
      1000K .......... .......... .......... .......... .......... 23%  108M 0s
      1050K .......... .......... .......... .......... .......... 24% 98.4M 0s
      1100K .......... .......... .......... .......... .......... 25%  115M 0s
      1150K .......... .......... .......... .......... .......... 26% 53.9M 0s
      1200K .......... .......... .......... .......... .......... 27% 57.1M 0s
      1250K .......... .......... .......... .......... .......... 28% 96.1M 0s
      1300K .......... .......... .......... .......... .......... 29%  196M 0s
      1350K .......... .......... .......... .......... .......... 30%  138M 0s
      1400K .......... .......... .......... .......... .......... 32%  133M 0s
      1450K .......... .......... .......... .......... .......... 33%  121M 0s
      1500K .......... .......... .......... .......... .......... 34% 93.9M 0s
      1550K .......... .......... .......... .......... .......... 35% 5.29M 0s
      1600K .......... .......... .......... .......... .......... 36%  128M 0s
      1650K .......... .......... .......... .......... .......... 37% 97.5M 0s
      1700K .......... .......... .......... .......... .......... 38% 78.8M 0s
      1750K .......... .......... .......... .......... .......... 39%  117M 0s
      1800K .......... .......... .......... .......... .......... 40% 95.6M 0s
      1850K .......... .......... .......... .......... .......... 41%  120M 0s
      1900K .......... .......... .......... .......... .......... 43% 85.1M 0s
      1950K .......... .......... .......... .......... .......... 44% 80.9M 0s
      2000K .......... .......... .......... .......... .......... 45%  126M 0s
      2050K .......... .......... .......... .......... .......... 46% 8.80M 0s
      2100K .......... .......... .......... .......... .......... 47%  129M 0s
      2150K .......... .......... .......... .......... .......... 48% 88.2M 0s
      2200K .......... .......... .......... .......... .......... 49% 39.6M 0s
      2250K .......... .......... .......... .......... .......... 50% 80.0M 0s
      2300K .......... .......... .......... .......... .......... 51%  116M 0s
      2350K .......... .......... .......... .......... .......... 53% 88.2M 0s
      2400K .......... .......... .......... .......... .......... 54% 88.7M 0s
      2450K .......... .......... .......... .......... .......... 55% 5.20M 0s
      2500K .......... .......... .......... .......... .......... 56% 84.0M 0s
      2550K .......... .......... .......... .......... .......... 57% 99.7M 0s
      2600K .......... .......... .......... .......... .......... 58% 62.1M 0s
      2650K .......... .......... .......... .......... .......... 59% 75.6M 0s
      2700K .......... .......... .......... .......... .......... 60% 99.2M 0s
      2750K .......... .......... .......... .......... .......... 61% 81.9M 0s
      2800K .......... .......... .......... .......... .......... 62%  118M 0s
      2850K .......... .......... .......... .......... .......... 64% 89.6M 0s
      2900K .......... .......... .......... .......... .......... 65%  117M 0s
      2950K .......... .......... .......... .......... .......... 66% 91.5M 0s
      3000K .......... .......... .......... .......... .......... 67%  123M 0s
      3050K .......... .......... .......... .......... .......... 68% 84.1M 0s
      3100K .......... .......... .......... .......... .......... 69%  118M 0s
      3150K .......... .......... .......... .......... .......... 70%  128M 0s
      3200K .......... .......... .......... .......... .......... 71% 96.6M 0s
      3250K .......... .......... .......... .......... .......... 72% 37.8M 0s
      3300K .......... .......... .......... .......... .......... 74% 64.9M 0s
      3350K .......... .......... .......... .......... .......... 75%  121M 0s
      3400K .......... .......... .......... .......... .......... 76% 3.82M 0s
      3450K .......... .......... .......... .......... .......... 77% 93.9M 0s
      3500K .......... .......... .......... .......... .......... 78%  129M 0s
      3550K .......... .......... .......... .......... .......... 79% 79.8M 0s
      3600K .......... .......... .......... .......... .......... 80%  113M 0s
      3650K .......... .......... .......... .......... .......... 81%  104M 0s
      3700K .......... .......... .......... .......... .......... 82% 82.3M 0s
      3750K .......... .......... .......... .......... .......... 83%  116M 0s
      3800K .......... .......... .......... .......... .......... 85%  122M 0s
      3850K .......... .......... .......... .......... .......... 86% 92.9M 0s
      3900K .......... .......... .......... .......... .......... 87%  110M 0s
      3950K .......... .......... .......... .......... .......... 88% 93.3M 0s
      4000K .......... .......... .......... .......... .......... 89% 88.7M 0s
      4050K .......... .......... .......... .......... .......... 90% 23.4M 0s
      4100K .......... .......... .......... .......... .......... 91% 89.4M 0s
      4150K .......... .......... .......... .......... .......... 92% 8.52M 0s
      4200K .......... .......... .......... .......... .......... 93%  122M 0s
      4250K .......... .......... .......... .......... .......... 95%  126M 0s
      4300K .......... .......... .......... .......... .......... 96% 11.9M 0s
      4350K .......... .......... .......... .......... .......... 97% 79.5M 0s
      4400K .......... .......... .......... .......... .......... 98%  115M 0s
      4450K .......... .......... .......... .......... .......... 99% 88.9M 0s
      4500K .......... .......... .....                           100%  112M=0.1s
    
    2015-06-09 22:42:47 (37.7 MB/s) - `nature12172-s1.zip' saved [4634226/4634226]
    

.. code:: python

    expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0)
    expression.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>S1</th>
          <th>S2</th>
          <th>S3</th>
          <th>S4</th>
          <th>S5</th>
          <th>S6</th>
          <th>S7</th>
          <th>S8</th>
          <th>S9</th>
          <th>S10</th>
          <th>...</th>
          <th>S12</th>
          <th>S13</th>
          <th>S14</th>
          <th>S15</th>
          <th>S16</th>
          <th>S17</th>
          <th>S18</th>
          <th>P1</th>
          <th>P2</th>
          <th>P3</th>
        </tr>
        <tr>
          <th>GENE</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>XKR4</th>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>...</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.019906</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>AB338584</th>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>...</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>B3GAT2</th>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.023441</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.029378</td>
          <td>0.000000</td>
          <td>0.055452</td>
          <td>0.000000</td>
          <td>0.029448</td>
          <td>...</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.031654</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>42.150208</td>
          <td>0.680327</td>
          <td>0.022996</td>
          <td>0.110236</td>
        </tr>
        <tr>
          <th>NPL</th>
          <td>72.008590</td>
          <td>0.000000</td>
          <td>128.062012</td>
          <td>0.095082</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>112.310234</td>
          <td>104.329122</td>
          <td>0.119230</td>
          <td>0.000000</td>
          <td>...</td>
          <td>0.000000</td>
          <td>0.116802</td>
          <td>0.104200</td>
          <td>0.106188</td>
          <td>0.229197</td>
          <td>0.110582</td>
          <td>0.000000</td>
          <td>7.109356</td>
          <td>6.727028</td>
          <td>14.525447</td>
        </tr>
        <tr>
          <th>T2</th>
          <td>0.109249</td>
          <td>0.172009</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.182703</td>
          <td>0.076012</td>
          <td>0.078698</td>
          <td>0.000000</td>
          <td>0.093698</td>
          <td>0.076583</td>
          <td>...</td>
          <td>0.693459</td>
          <td>0.010137</td>
          <td>0.081936</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.086879</td>
          <td>0.068174</td>
          <td>0.062063</td>
          <td>0.000000</td>
          <td>0.050605</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 21 columns</p>
    </div>


These data are in the "transcripts per million," aka TPM unit. See
`this <http://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/>`_
blog post if that sounds weird to you.

These data are formatted with samples on the columns, and genes on the
rows. But we want the opposite, with samples on the rows and genes on
the columns. This follows
```scikit-learn`` <http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-an-example-dataset>`_'s
standard of data matrices with size (``n_samples``, ``n_features``) as
each gene is a feature. So we will simply transpose this.

.. code:: python

    expression = expression.T
    expression.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th>GENE</th>
          <th>XKR4</th>
          <th>AB338584</th>
          <th>B3GAT2</th>
          <th>NPL</th>
          <th>T2</th>
          <th>T</th>
          <th>PDE10A</th>
          <th>1700010I14RIK</th>
          <th>6530411M01RIK</th>
          <th>PABPC6</th>
          <th>...</th>
          <th>AK085062</th>
          <th>DHX9</th>
          <th>RNASET2B</th>
          <th>FGFR1OP</th>
          <th>CCR6</th>
          <th>BRP44L</th>
          <th>AK014435</th>
          <th>AK015714</th>
          <th>SFT2D1</th>
          <th>PRR18</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>72.008590</td>
          <td>0.109249</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>...</td>
          <td>0</td>
          <td>0.774638</td>
          <td>23.520936</td>
          <td>0.000000</td>
          <td>0</td>
          <td>460.316773</td>
          <td>0</td>
          <td>0.000000</td>
          <td>39.442566</td>
          <td>0</td>
        </tr>
        <tr>
          <th>S2</th>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.172009</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>...</td>
          <td>0</td>
          <td>0.367391</td>
          <td>1.887873</td>
          <td>0.000000</td>
          <td>0</td>
          <td>823.890290</td>
          <td>0</td>
          <td>0.000000</td>
          <td>4.967412</td>
          <td>0</td>
        </tr>
        <tr>
          <th>S3</th>
          <td>0</td>
          <td>0</td>
          <td>0.023441</td>
          <td>128.062012</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>...</td>
          <td>0</td>
          <td>0.249858</td>
          <td>0.313510</td>
          <td>0.166772</td>
          <td>0</td>
          <td>1002.354241</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0</td>
        </tr>
        <tr>
          <th>S4</th>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.095082</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>...</td>
          <td>0</td>
          <td>0.354157</td>
          <td>0.000000</td>
          <td>0.887003</td>
          <td>0</td>
          <td>1230.766795</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.131215</td>
          <td>0</td>
        </tr>
        <tr>
          <th>S5</th>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.182703</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>...</td>
          <td>0</td>
          <td>0.039263</td>
          <td>0.000000</td>
          <td>131.077131</td>
          <td>0</td>
          <td>1614.749122</td>
          <td>0</td>
          <td>0.242179</td>
          <td>95.485743</td>
          <td>0</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 27723 columns</p>
    </div>


The authors filtered the expression data based on having at least 3
single cells express genes with at TPM (transcripts per million, ) > 1.
We can express this in using the
```pandas`` <http://pandas.pydata.org>`_ DataFrames easily.

First, from reading the paper and looking at the data, I know there are
18 single cells, and there are 18 samples that start with the letter
"S." So I will extract the single samples from the ``index`` (row names)
using a ``lambda``, a tiny function which in this case, tells me whether
or not that sample id begins with the letter "S".

.. code:: python

    singles_ids = expression.index[expression.index.map(lambda x: x.startswith('S'))]
    print('number of single cells:', len(singles_ids))
    singles = expression.ix[singles_ids]
    
    expression_filtered = expression.ix[:, singles[singles > 1].count() >= 3]
    expression_filtered = np.log(expression_filtered + 1)
    expression_filtered.shape

.. parsed-literal::

    ('number of single cells:', 18)


.. parsed-literal::

    (21, 6312)


Hmm, that's strange. The paper states that they had 6313 genes after
filtering, but I get 6312. Even using "``singles >= 1``" doesn't help.

(I also tried this with the expression table provided in the
supplementary data as "``SupplementaryTable2.xlsx``," and got the same
results.)

Now that we've taken care of importing and filtering the expression
data, let's do the feature data of the expression data.

Expression feature data
-----------------------

This is similar to the ``fData`` from ``BioconductoR``, where there's
some additional data on your features that you want to look at. They
uploaded information about the features in their OTHER expression
matrix, uploaded as a supplementary file, ``Supplementary_Table2.xlsx``.

Notice that this is a ``csv`` and not an ``xlsx``. This is because Excel
mangled the gene IDS that started with ``201*`` and assumed they were
dates :(

The workaround I did was to add another column to the sheet with the
formula ``="'" & A1``, press ``Command``-``Shift``-``End`` to select the
end of the rows, and then do ``Ctrl``-``D`` to "fill down" to the bottom
(thanks to
`this <http://superuser.com/questions/298276/excel-keyboard-shortcut-to-copy-fill-down-for-all-cells-with-non-blank-adjacent>`_
stackoverflow post for teaching me how to Excel). Then, I saved the file
as a ``csv`` for maximum portability and compatibility.

So sorry, this requires some non-programming editing! But I've posted
the csv to our `github repo <https://github.com/YeoLab/shalek2013>`_
with all the data, and we'll access it from there.

.. code:: python

    expression2 = pd.read_csv('https://raw.githubusercontent.com/YeoLab/shalek2013/master/Supplementary_Table2.csv', 
                                # Need to specify the index column as both the first and the last columns,
                                # Because the last column is the "Gene Category"
                                index_col=[0, -1], parse_dates=False, infer_datetime_format=False)
    
    # This was also in features x samples format, so we need to transpose
    expression2 = expression2.T
    expression2.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th>'GENE</th>
          <th>'0610007L01RIK</th>
          <th>'0610007P14RIK</th>
          <th>'0610007P22RIK</th>
          <th>'0610008F07RIK</th>
          <th>'0610009B22RIK</th>
          <th>'0610009D07RIK</th>
          <th>'0610009O20RIK</th>
          <th>'0610010B08RIK</th>
          <th>'0610010F05RIK</th>
          <th>'0610010K06RIK</th>
          <th>...</th>
          <th>'ZWILCH</th>
          <th>'ZWINT</th>
          <th>'ZXDA</th>
          <th>'ZXDB</th>
          <th>'ZXDC</th>
          <th>'ZYG11A</th>
          <th>'ZYG11B</th>
          <th>'ZYX</th>
          <th>'ZZEF1</th>
          <th>'ZZZ3</th>
        </tr>
        <tr>
          <th>Gene Category</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>...</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td>27.181570</td>
          <td>0.166794</td>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>178.852732</td>
          <td>0</td>
          <td>0.962417</td>
          <td>0.000000</td>
          <td>143.359550</td>
          <td>...</td>
          <td>0.000000</td>
          <td>302.361227</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.027717</td>
          <td>297.918756</td>
          <td>37.685501</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>S2</th>
          <td>37.682691</td>
          <td>0.263962</td>
          <td>0</td>
          <td>0</td>
          <td>0.207921</td>
          <td>0.141099</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.255617</td>
          <td>...</td>
          <td>0.000000</td>
          <td>96.033724</td>
          <td>0.020459</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.042430</td>
          <td>0.242888</td>
          <td>0.000000</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>S3</th>
          <td>0.056916</td>
          <td>78.622459</td>
          <td>0</td>
          <td>0</td>
          <td>0.145680</td>
          <td>0.396363</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.024692</td>
          <td>72.775846</td>
          <td>...</td>
          <td>0.000000</td>
          <td>427.915555</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.040407</td>
          <td>6.753530</td>
          <td>0.132011</td>
          <td>0.017615</td>
        </tr>
        <tr>
          <th>S4</th>
          <td>55.649250</td>
          <td>0.228866</td>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>88.798158</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>93.825442</td>
          <td>...</td>
          <td>0.000000</td>
          <td>9.788557</td>
          <td>0.017787</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.013452</td>
          <td>0.274689</td>
          <td>9.724890</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>S5</th>
          <td>0.000000</td>
          <td>0.093117</td>
          <td>0</td>
          <td>0</td>
          <td>131.326008</td>
          <td>155.936361</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.031029</td>
          <td>...</td>
          <td>0.204522</td>
          <td>26.575760</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>1.101589</td>
          <td>59.256094</td>
          <td>44.430726</td>
          <td>0.000000</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 27723 columns</p>
    </div>


Now we need to strip the single-quote I added to all the gene names:

.. code:: python

    new_index, indexer = expression2.columns.reindex(map(lambda x: (x[0].lstrip("'"), x[1]), expression2.columns.values))
    expression2.columns = new_index
    expression2.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th>'GENE</th>
          <th>0610007L01RIK</th>
          <th>0610007P14RIK</th>
          <th>0610007P22RIK</th>
          <th>0610008F07RIK</th>
          <th>0610009B22RIK</th>
          <th>0610009D07RIK</th>
          <th>0610009O20RIK</th>
          <th>0610010B08RIK</th>
          <th>0610010F05RIK</th>
          <th>0610010K06RIK</th>
          <th>...</th>
          <th>ZWILCH</th>
          <th>ZWINT</th>
          <th>ZXDA</th>
          <th>ZXDB</th>
          <th>ZXDC</th>
          <th>ZYG11A</th>
          <th>ZYG11B</th>
          <th>ZYX</th>
          <th>ZZEF1</th>
          <th>ZZZ3</th>
        </tr>
        <tr>
          <th>Gene Category</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>...</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
          <th>NaN</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td>27.181570</td>
          <td>0.166794</td>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>178.852732</td>
          <td>0</td>
          <td>0.962417</td>
          <td>0.000000</td>
          <td>143.359550</td>
          <td>...</td>
          <td>0.000000</td>
          <td>302.361227</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.027717</td>
          <td>297.918756</td>
          <td>37.685501</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>S2</th>
          <td>37.682691</td>
          <td>0.263962</td>
          <td>0</td>
          <td>0</td>
          <td>0.207921</td>
          <td>0.141099</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.255617</td>
          <td>...</td>
          <td>0.000000</td>
          <td>96.033724</td>
          <td>0.020459</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.042430</td>
          <td>0.242888</td>
          <td>0.000000</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>S3</th>
          <td>0.056916</td>
          <td>78.622459</td>
          <td>0</td>
          <td>0</td>
          <td>0.145680</td>
          <td>0.396363</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.024692</td>
          <td>72.775846</td>
          <td>...</td>
          <td>0.000000</td>
          <td>427.915555</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.040407</td>
          <td>6.753530</td>
          <td>0.132011</td>
          <td>0.017615</td>
        </tr>
        <tr>
          <th>S4</th>
          <td>55.649250</td>
          <td>0.228866</td>
          <td>0</td>
          <td>0</td>
          <td>0.000000</td>
          <td>88.798158</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>93.825442</td>
          <td>...</td>
          <td>0.000000</td>
          <td>9.788557</td>
          <td>0.017787</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0.013452</td>
          <td>0.274689</td>
          <td>9.724890</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>S5</th>
          <td>0.000000</td>
          <td>0.093117</td>
          <td>0</td>
          <td>0</td>
          <td>131.326008</td>
          <td>155.936361</td>
          <td>0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.031029</td>
          <td>...</td>
          <td>0.204522</td>
          <td>26.575760</td>
          <td>0.000000</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>1.101589</td>
          <td>59.256094</td>
          <td>44.430726</td>
          <td>0.000000</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 27723 columns</p>
    </div>


We want to create a ``pandas.DataFrame`` from the "Gene Category" row
for our ``expression_feature_data``, which we will do via:

.. code:: python

    gene_ids, gene_category = zip(*expression2.columns.values)
    gene_categories = pd.Series(gene_category, index=gene_ids, name='gene_category')
    gene_categories


.. parsed-literal::

    0610007L01RIK             NaN
    0610007P14RIK             NaN
    0610007P22RIK             NaN
    0610008F07RIK             NaN
    0610009B22RIK             NaN
    0610009D07RIK             NaN
    0610009O20RIK             NaN
    0610010B08RIK             NaN
    0610010F05RIK             NaN
    0610010K06RIK             NaN
    0610010K14RIK             NaN
    0610010O12RIK             NaN
    0610011F06RIK             NaN
    0610011L14RIK             NaN
    0610012G03RIK             NaN
    0610012H03RIK             NaN
    0610030E20RIK             NaN
    0610031J06RIK             NaN
    0610037L13RIK             NaN
    0610037P05RIK             NaN
    0610038B21RIK             NaN
    0610039K10RIK             NaN
    0610040B10RIK             NaN
    0610040J01RIK             NaN
    0910001L09RIK             NaN
    100043387                 NaN
    1100001G20RIK             NaN
    1110001A16RIK             NaN
    1110001J03RIK             NaN
    1110002B05RIK             NaN
                         ...     
    ZSCAN20                   NaN
    ZSCAN21                   NaN
    ZSCAN22                   NaN
    ZSCAN29                   NaN
    ZSCAN30                   NaN
    ZSCAN4B                   NaN
    ZSCAN4C                   NaN
    ZSCAN4D                   NaN
    ZSCAN4E                   NaN
    ZSCAN4F                   NaN
    ZSCAN5B                   NaN
    ZSWIM1                    NaN
    ZSWIM2                    NaN
    ZSWIM3                    NaN
    ZSWIM4                    NaN
    ZSWIM5                    NaN
    ZSWIM6                    NaN
    ZSWIM7                    NaN
    ZUFSP            LPS Response
    ZW10                      NaN
    ZWILCH                    NaN
    ZWINT                     NaN
    ZXDA                      NaN
    ZXDB                      NaN
    ZXDC                      NaN
    ZYG11A                    NaN
    ZYG11B                    NaN
    ZYX                       NaN
    ZZEF1                     NaN
    ZZZ3                      NaN
    Name: gene_category, dtype: object


.. code:: python

    expression_feature_data = pd.DataFrame(gene_categories)
    expression_feature_data.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>gene_category</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0610007L01RIK</th>
          <td>NaN</td>
        </tr>
        <tr>
          <th>0610007P14RIK</th>
          <td>NaN</td>
        </tr>
        <tr>
          <th>0610007P22RIK</th>
          <td>NaN</td>
        </tr>
        <tr>
          <th>0610008F07RIK</th>
          <td>NaN</td>
        </tr>
        <tr>
          <th>0610009B22RIK</th>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    </div>


Splicing Data
-------------

We obtain the splicing data from this study from the supplementary
information, specifically the ``Supplementary_Table4.xls``

.. code:: python

    splicing = pd.read_excel('nature12172-s1/Supplementary_Table4.xls', 'splicingTable.txt', index_col=(0,1))
    splicing.head()

::


    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)

    <ipython-input-11-6956dd3a6ad6> in <module>()
    ----> 1 splicing = pd.read_excel('nature12172-s1/Supplementary_Table4.xls', 'splicingTable.txt', index_col=(0,1))
          2 splicing.head()


    /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, **kwds)
        149     engine = kwds.pop('engine', None)
        150 
    --> 151     return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
        152 
        153 


    /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
        167     def __init__(self, io, **kwds):
        168 
    --> 169         import xlrd  # throw an ImportError if we need to
        170 
        171         ver = tuple(map(int, xlrd.__VERSION__.split(".")[:2]))


    ImportError: No module named xlrd


.. code:: python

    splicing = splicing.T
    splicing


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th>Event name</th>
          <th>chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-</th>
          <th>chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-</th>
          <th>chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+</th>
          <th>chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+</th>
          <th>chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+</th>
          <th>chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+</th>
          <th>chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+</th>
          <th>chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+</th>
          <th>chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:-</th>
          <th>chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+</th>
          <th>...</th>
          <th>chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+</th>
          <th>chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:-</th>
          <th>chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+</th>
          <th>chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:-</th>
          <th>chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:-</th>
          <th>chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:-</th>
          <th>chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:-</th>
          <th>chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+</th>
          <th>chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+</th>
          <th>chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+</th>
        </tr>
        <tr>
          <th>gene</th>
          <th>Os9</th>
          <th>Vta1</th>
          <th>Bclaf1</th>
          <th>Bclaf1</th>
          <th>P4ha1</th>
          <th>Bsg</th>
          <th>Ptbp1</th>
          <th>Igf1</th>
          <th>Elk3</th>
          <th>Nbr1</th>
          <th>...</th>
          <th>Afg3l1</th>
          <th>Tep1</th>
          <th>Fgd2</th>
          <th>Ttc17</th>
          <th>Tmed1</th>
          <th>Tmed1</th>
          <th>Sbno2</th>
          <th>Synj2</th>
          <th>Tbc1d13</th>
          <th>Usp47</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td> 0.84</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td> 0.02</td>
          <td> 0.42</td>
          <td>  NaN</td>
          <td> 0.57</td>
          <td> 0.31</td>
          <td> 0.93</td>
          <td> 0.57</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S2</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.04</td>
          <td> 0.98</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S3</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.02</td>
          <td> 0.55</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.20</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S4</th>
          <td>  NaN</td>
          <td> 0.84</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td> 0.04</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S5</th>
          <td>  NaN</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.94</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.73</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S6</th>
          <td> 0.01</td>
          <td> 0.91</td>
          <td> 0.14</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.61</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S7</th>
          <td>  NaN</td>
          <td> 0.87</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.62</td>
          <td>  NaN</td>
          <td> 0.85</td>
          <td> 0.73</td>
          <td> 0.55</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S8</th>
          <td>  NaN</td>
          <td> 0.86</td>
          <td> 0.02</td>
          <td> 0.98</td>
          <td> 0.03</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.89</td>
          <td> 0.82</td>
          <td> 0.83</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S9</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.97</td>
          <td>  NaN</td>
          <td> 0.97</td>
          <td>  NaN</td>
          <td> 0.90</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S10</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.06</td>
          <td> 0.98</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S11</th>
          <td> 0.03</td>
          <td> 0.93</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.97</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S13</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S14</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.88</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S15</th>
          <td> 0.02</td>
          <td> 0.96</td>
          <td> 0.01</td>
          <td> 0.06</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.44</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td> 0.91</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S16</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td> 0.27</td>
          <td> 0.99</td>
          <td> 0.99</td>
          <td> 0.98</td>
          <td> 0.98</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>S17</th>
          <td> 0.01</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.96</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.99</td>
          <td> 0.98</td>
          <td> 0.67</td>
          <td> 0.07</td>
        </tr>
        <tr>
          <th>S18</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>10,000 cell Rep1 (P1)</th>
          <td> 0.27</td>
          <td> 0.83</td>
          <td> 0.40</td>
          <td> 0.62</td>
          <td> 0.43</td>
          <td> 0.78</td>
          <td>  NaN</td>
          <td> 0.60</td>
          <td> 0.76</td>
          <td> 0.52</td>
          <td>...</td>
          <td> 0.92</td>
          <td>  NaN</td>
          <td> 0.81</td>
          <td> 0.77</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.84</td>
          <td> 0.50</td>
          <td> 0.56</td>
          <td>  NaN</td>
        </tr>
        <tr>
          <th>10,000 cell Rep2 (P2)</th>
          <td> 0.37</td>
          <td> 0.85</td>
          <td> 0.49</td>
          <td> 0.63</td>
          <td> 0.36</td>
          <td> 0.72</td>
          <td> 0.47</td>
          <td> 0.60</td>
          <td> 0.73</td>
          <td> 0.68</td>
          <td>...</td>
          <td> 0.67</td>
          <td> 0.15</td>
          <td> 0.52</td>
          <td> 0.67</td>
          <td> 0.63</td>
          <td> 0.73</td>
          <td> 0.82</td>
          <td> 0.90</td>
          <td> 0.71</td>
          <td> 0.55</td>
        </tr>
        <tr>
          <th>10,000 cell Rep3 (P3)</th>
          <td> 0.31</td>
          <td> 0.64</td>
          <td> 0.59</td>
          <td> 0.70</td>
          <td> 0.52</td>
          <td> 0.79</td>
          <td>  NaN</td>
          <td> 0.65</td>
          <td> 0.42</td>
          <td> 0.64</td>
          <td>...</td>
          <td> 0.58</td>
          <td> 0.79</td>
          <td> 0.74</td>
          <td> 0.85</td>
          <td> 0.73</td>
          <td> 0.39</td>
          <td> 0.56</td>
          <td>  NaN</td>
          <td> 0.64</td>
          <td>  NaN</td>
        </tr>
      </tbody>
    </table>
    <p>20 rows × 352 columns</p>
    </div>


The three pooled samples aren't named consistently with the expression
data, so we have to fix that.

.. code:: python

    splicing.index[splicing.index.map(lambda x: 'P' in x)]


.. parsed-literal::

    Index([u'10,000 cell Rep1 (P1)', u'10,000 cell Rep2 (P2)', u'10,000 cell Rep3 (P3)'], dtype='object')


Since the pooled sample IDs are inconsistent with the ``expression``
data, we have to change them. We can get the "P" and the number after
that using regular expressions, called ``re`` in the Python standard
library, e.g.:

.. code:: python

    import re
    re.search(r'P\d', '10,000 cell Rep1 (P1)').group()


.. parsed-literal::

    'P1'


.. code:: python

    def long_pooled_name_to_short(x):
        if 'P' not in x:
            return x
        else:
            return re.search(r'P\d', x).group()
    
    
    splicing.index.map(long_pooled_name_to_short)


.. parsed-literal::

    array([u'S1', u'S2', u'S3', u'S4', u'S5', u'S6', u'S7', u'S8', u'S9',
           u'S10', u'S11', u'S13', u'S14', u'S15', u'S16', u'S17', u'S18',
           u'P1', u'P2', u'P3'], dtype=object)


And now we assign this new index as our index to the ``splicing``
dataframe

.. code:: python

    splicing.index = splicing.index.map(long_pooled_name_to_short)
    splicing.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th>Event name</th>
          <th>chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-</th>
          <th>chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-</th>
          <th>chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+</th>
          <th>chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+</th>
          <th>chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+</th>
          <th>chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+</th>
          <th>chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+</th>
          <th>chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+</th>
          <th>chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:-</th>
          <th>chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+</th>
          <th>...</th>
          <th>chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+</th>
          <th>chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:-</th>
          <th>chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+</th>
          <th>chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:-</th>
          <th>chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:-</th>
          <th>chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:-</th>
          <th>chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:-</th>
          <th>chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+</th>
          <th>chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+</th>
          <th>chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+</th>
        </tr>
        <tr>
          <th>gene</th>
          <th>Os9</th>
          <th>Vta1</th>
          <th>Bclaf1</th>
          <th>Bclaf1</th>
          <th>P4ha1</th>
          <th>Bsg</th>
          <th>Ptbp1</th>
          <th>Igf1</th>
          <th>Elk3</th>
          <th>Nbr1</th>
          <th>...</th>
          <th>Afg3l1</th>
          <th>Tep1</th>
          <th>Fgd2</th>
          <th>Ttc17</th>
          <th>Tmed1</th>
          <th>Tmed1</th>
          <th>Sbno2</th>
          <th>Synj2</th>
          <th>Tbc1d13</th>
          <th>Usp47</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td> 0.84</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td> 0.02</td>
          <td> 0.42</td>
          <td>NaN</td>
          <td> 0.57</td>
          <td> 0.31</td>
          <td> 0.93</td>
          <td> 0.57</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S2</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.04</td>
          <td> 0.98</td>
          <td>  NaN</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S3</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.02</td>
          <td> 0.55</td>
          <td>  NaN</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td> 0.20</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S4</th>
          <td>  NaN</td>
          <td> 0.84</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td> 0.04</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S5</th>
          <td>  NaN</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.94</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td> 0.73</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 352 columns</p>
    </div>


Remove Multi-index columns
~~~~~~~~~~~~~~~~~~~~~~~~~~

Currently, ``flotilla`` only supports non-multi-index Dataframes. This
means that we need to change the columns of ``splicing`` to just the
unique event name. We'll save this data as ``splicing_feature_data``,
which will rename the crazy feature id to the reasonable gene name.

Splicing Feature Data
~~~~~~~~~~~~~~~~~~~~~

First, let's extract the event names and gene names from ``splicing``.

.. code:: python

    event_names, gene_names = zip(*splicing.columns.tolist())
.. code:: python

    event_names[:10]


.. parsed-literal::

    (u'chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-',
     u'chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-',
     u'chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+',
     u'chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+',
     u'chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+',
     u'chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+',
     u'chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+',
     u'chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+',
     u'chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:-',
     u'chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+')


.. code:: python

    gene_names[:10]


.. parsed-literal::

    (u'Os9',
     u'Vta1',
     u'Bclaf1',
     u'Bclaf1',
     u'P4ha1',
     u'Bsg',
     u'Ptbp1',
     u'Igf1',
     u'Elk3',
     u'Nbr1')


Now we can rename the columns of ``splicing`` easily

.. code:: python

    splicing.columns = event_names
    splicing.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-</th>
          <th>chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-</th>
          <th>chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+</th>
          <th>chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+</th>
          <th>chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+</th>
          <th>chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+</th>
          <th>chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+</th>
          <th>chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+</th>
          <th>chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:-</th>
          <th>chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+</th>
          <th>...</th>
          <th>chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+</th>
          <th>chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:-</th>
          <th>chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+</th>
          <th>chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:-</th>
          <th>chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:-</th>
          <th>chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:-</th>
          <th>chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:-</th>
          <th>chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+</th>
          <th>chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+</th>
          <th>chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td> 0.84</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td> 0.02</td>
          <td> 0.42</td>
          <td>NaN</td>
          <td> 0.57</td>
          <td> 0.31</td>
          <td> 0.93</td>
          <td> 0.57</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S2</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.04</td>
          <td> 0.98</td>
          <td>  NaN</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S3</th>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.02</td>
          <td> 0.55</td>
          <td>  NaN</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td> 0.20</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S4</th>
          <td>  NaN</td>
          <td> 0.84</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td> 0.04</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>S5</th>
          <td>  NaN</td>
          <td> 0.95</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td> 0.94</td>
          <td>NaN</td>
          <td>  NaN</td>
          <td> 0.73</td>
          <td>  NaN</td>
          <td>  NaN</td>
          <td>...</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 352 columns</p>
    </div>


Now let's create ``splicing_feature_data`` to map these event names to
the gene names, and to the ``gene_category`` from before.

.. code:: python

    splicing_feature_data = pd.DataFrame(index=event_names)
    splicing_feature_data['gene_name'] = gene_names
    splicing_feature_data.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>gene_name</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-</th>
          <td>    Os9</td>
        </tr>
        <tr>
          <th>chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-</th>
          <td>   Vta1</td>
        </tr>
        <tr>
          <th>chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+</th>
          <td> Bclaf1</td>
        </tr>
        <tr>
          <th>chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+</th>
          <td> Bclaf1</td>
        </tr>
        <tr>
          <th>chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+</th>
          <td>  P4ha1</td>
        </tr>
      </tbody>
    </table>
    </div>


One thing we need to keep in mind is that the gene names in the
``expression`` data were uppercase. We can convert our gene names to
uppercase with,\`

.. code:: python

    splicing_feature_data['gene_name'] = splicing_feature_data['gene_name'].str.upper()
    splicing_feature_data.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>gene_name</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-</th>
          <td>    OS9</td>
        </tr>
        <tr>
          <th>chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-</th>
          <td>   VTA1</td>
        </tr>
        <tr>
          <th>chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+</th>
          <td> BCLAF1</td>
        </tr>
        <tr>
          <th>chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+</th>
          <td> BCLAF1</td>
        </tr>
        <tr>
          <th>chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+</th>
          <td>  P4HA1</td>
        </tr>
      </tbody>
    </table>
    </div>


Now let's get the ``gene_category`` of these genes by doing a
`join <http://pandas.pydata.org/pandas-docs/stable/merging.html>`_ on
the splicing data and the expression data.

.. code:: python

    splicing_feature_data = splicing_feature_data.join(expression_feature_data, on='gene_name')
    splicing_feature_data.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>gene_name</th>
          <th>gene_category</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-</th>
          <td>    OS9</td>
          <td>          NaN</td>
        </tr>
        <tr>
          <th>chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-</th>
          <td>   VTA1</td>
          <td>          NaN</td>
        </tr>
        <tr>
          <th>chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+</th>
          <td> BCLAF1</td>
          <td>          NaN</td>
        </tr>
        <tr>
          <th>chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+</th>
          <td> BCLAF1</td>
          <td>          NaN</td>
        </tr>
        <tr>
          <th>chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+</th>
          <td>  P4HA1</td>
          <td> LPS Response</td>
        </tr>
      </tbody>
    </table>
    </div>


Now we have the **gene\_category** encoded in the splicing data as well!

Metadata
--------

Now let's get into creating a metadata dataframe. We'll use the index
from the ``expression_filtered`` data to create the minimum required
column, ``'phenotype'``, which has the name of the phenotype of that
cell. And we'll also add the column ``'pooled'`` to indicate whether
this sample is pooled or not.

.. code:: python

    metadata = pd.DataFrame(index=expression_filtered.index)
    metadata['phenotype'] = 'BDMC'
    metadata['pooled'] = metadata.index.map(lambda x: x.startswith('P'))
    
    metadata


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>phenotype</th>
          <th>pooled</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S2</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S3</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S4</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S5</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S6</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S7</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S8</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S9</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S10</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S11</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S12</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S13</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S14</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S15</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S16</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S17</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>S18</th>
          <td> BDMC</td>
          <td> False</td>
        </tr>
        <tr>
          <th>P1</th>
          <td> BDMC</td>
          <td>  True</td>
        </tr>
        <tr>
          <th>P2</th>
          <td> BDMC</td>
          <td>  True</td>
        </tr>
        <tr>
          <th>P3</th>
          <td> BDMC</td>
          <td>  True</td>
        </tr>
      </tbody>
    </table>
    </div>


Mapping stats data
------------------

.. code:: python

    mapping_stats = pd.read_excel('nature12172-s1/Supplementary_Table1.xls', sheetname='SuppTable1 2.txt')
    mapping_stats


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Sample</th>
          <th>PF_READS</th>
          <th>PCT_MAPPED_GENOME</th>
          <th>PCT_RIBOSOMAL_BASES</th>
          <th>MEDIAN_CV_COVERAGE</th>
          <th>MEDIAN_5PRIME_BIAS</th>
          <th>MEDIAN_3PRIME_BIAS</th>
          <th>MEDIAN_5PRIME_TO_3PRIME_BIAS</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0 </th>
          <td>       S1</td>
          <td> 21326048</td>
          <td> 0.706590</td>
          <td> 0.006820</td>
          <td> 0.509939</td>
          <td> 0.092679</td>
          <td> 0.477321</td>
          <td> 0.247741</td>
        </tr>
        <tr>
          <th>1 </th>
          <td>       S2</td>
          <td> 27434011</td>
          <td> 0.745385</td>
          <td> 0.004111</td>
          <td> 0.565732</td>
          <td> 0.056583</td>
          <td> 0.321053</td>
          <td> 0.244062</td>
        </tr>
        <tr>
          <th>2 </th>
          <td>       S3</td>
          <td> 31142391</td>
          <td> 0.722087</td>
          <td> 0.006428</td>
          <td> 0.540341</td>
          <td> 0.079551</td>
          <td> 0.382286</td>
          <td> 0.267367</td>
        </tr>
        <tr>
          <th>3 </th>
          <td>       S4</td>
          <td> 26231852</td>
          <td> 0.737854</td>
          <td> 0.004959</td>
          <td> 0.530978</td>
          <td> 0.067041</td>
          <td> 0.351670</td>
          <td> 0.279782</td>
        </tr>
        <tr>
          <th>4 </th>
          <td>       S5</td>
          <td> 29977214</td>
          <td> 0.746466</td>
          <td> 0.006121</td>
          <td> 0.525598</td>
          <td> 0.066543</td>
          <td> 0.353995</td>
          <td> 0.274252</td>
        </tr>
        <tr>
          <th>5 </th>
          <td>       S6</td>
          <td> 24148387</td>
          <td> 0.730079</td>
          <td> 0.008794</td>
          <td> 0.529650</td>
          <td> 0.072095</td>
          <td> 0.413696</td>
          <td> 0.225929</td>
        </tr>
        <tr>
          <th>6 </th>
          <td>       S7</td>
          <td> 24078116</td>
          <td> 0.730638</td>
          <td> 0.007945</td>
          <td> 0.540913</td>
          <td> 0.051991</td>
          <td> 0.358597</td>
          <td> 0.201984</td>
        </tr>
        <tr>
          <th>7 </th>
          <td>       S8</td>
          <td> 25032126</td>
          <td> 0.739989</td>
          <td> 0.004133</td>
          <td> 0.512725</td>
          <td> 0.058783</td>
          <td> 0.373509</td>
          <td> 0.212337</td>
        </tr>
        <tr>
          <th>8 </th>
          <td>       S9</td>
          <td> 22257682</td>
          <td> 0.747427</td>
          <td> 0.004869</td>
          <td> 0.521622</td>
          <td> 0.063566</td>
          <td> 0.334294</td>
          <td> 0.240641</td>
        </tr>
        <tr>
          <th>9 </th>
          <td>      S10</td>
          <td> 29436289</td>
          <td> 0.748795</td>
          <td> 0.005499</td>
          <td> 0.560454</td>
          <td> 0.036219</td>
          <td> 0.306729</td>
          <td> 0.187479</td>
        </tr>
        <tr>
          <th>10</th>
          <td>      S11</td>
          <td> 31130278</td>
          <td> 0.741882</td>
          <td> 0.002740</td>
          <td> 0.558882</td>
          <td> 0.049581</td>
          <td> 0.349191</td>
          <td> 0.211787</td>
        </tr>
        <tr>
          <th>11</th>
          <td>      S12</td>
          <td> 21161595</td>
          <td> 0.750782</td>
          <td> 0.006837</td>
          <td> 0.756339</td>
          <td> 0.013878</td>
          <td> 0.324264</td>
          <td> 0.195430</td>
        </tr>
        <tr>
          <th>12</th>
          <td>      S13</td>
          <td> 28612833</td>
          <td> 0.733976</td>
          <td> 0.011718</td>
          <td> 0.598687</td>
          <td> 0.035392</td>
          <td> 0.357447</td>
          <td> 0.198566</td>
        </tr>
        <tr>
          <th>13</th>
          <td>      S14</td>
          <td> 26351189</td>
          <td> 0.748323</td>
          <td> 0.004106</td>
          <td> 0.517518</td>
          <td> 0.070293</td>
          <td> 0.381095</td>
          <td> 0.259122</td>
        </tr>
        <tr>
          <th>14</th>
          <td>      S15</td>
          <td> 25739575</td>
          <td> 0.748421</td>
          <td> 0.003353</td>
          <td> 0.526238</td>
          <td> 0.050938</td>
          <td> 0.324207</td>
          <td> 0.212366</td>
        </tr>
        <tr>
          <th>15</th>
          <td>      S16</td>
          <td> 26802346</td>
          <td> 0.739833</td>
          <td> 0.009370</td>
          <td> 0.520287</td>
          <td> 0.071503</td>
          <td> 0.358758</td>
          <td> 0.240009</td>
        </tr>
        <tr>
          <th>16</th>
          <td>      S17</td>
          <td> 26343522</td>
          <td> 0.749358</td>
          <td> 0.003155</td>
          <td> 0.673195</td>
          <td> 0.024121</td>
          <td> 0.301588</td>
          <td> 0.245854</td>
        </tr>
        <tr>
          <th>17</th>
          <td>      S18</td>
          <td> 25290073</td>
          <td> 0.749358</td>
          <td> 0.007465</td>
          <td> 0.562382</td>
          <td> 0.048528</td>
          <td> 0.314776</td>
          <td> 0.215160</td>
        </tr>
        <tr>
          <th>18</th>
          <td> 10k_rep1</td>
          <td> 28247826</td>
          <td> 0.688553</td>
          <td> 0.018993</td>
          <td> 0.547000</td>
          <td> 0.056113</td>
          <td> 0.484393</td>
          <td> 0.140333</td>
        </tr>
        <tr>
          <th>19</th>
          <td> 10k_rep2</td>
          <td> 39303876</td>
          <td> 0.690313</td>
          <td> 0.017328</td>
          <td> 0.547621</td>
          <td> 0.055600</td>
          <td> 0.474634</td>
          <td> 0.142889</td>
        </tr>
        <tr>
          <th>20</th>
          <td> 10k_rep3</td>
          <td> 29831281</td>
          <td> 0.710875</td>
          <td> 0.010610</td>
          <td> 0.518053</td>
          <td> 0.066053</td>
          <td> 0.488738</td>
          <td> 0.168180</td>
        </tr>
        <tr>
          <th>21</th>
          <td>   MB_SC1</td>
          <td> 13848219</td>
          <td> 0.545000</td>
          <td> 0.007000</td>
          <td> 0.531495</td>
          <td> 0.127934</td>
          <td> 0.207841</td>
          <td> 0.728980</td>
        </tr>
        <tr>
          <th>22</th>
          <td>   MB_SC2</td>
          <td> 13550218</td>
          <td> 0.458000</td>
          <td> 0.010800</td>
          <td> 0.569271</td>
          <td> 0.102581</td>
          <td> 0.179407</td>
          <td> 0.694747</td>
        </tr>
        <tr>
          <th>23</th>
          <td>   MB_SC3</td>
          <td> 26765848</td>
          <td> 0.496000</td>
          <td> 0.007900</td>
          <td> 0.535192</td>
          <td> 0.141893</td>
          <td> 0.231068</td>
          <td> 0.722080</td>
        </tr>
      </tbody>
    </table>
    </div>


Create a ``flotilla`` Study!
----------------------------

.. code:: python

    study = flotilla.Study(# The metadata describing phenotype and pooled samples
                           metadata, 
                           
                           # A version for this data
                           version='0.1.0', 
                           
                           # Dataframe of the filtered expression data
                           expression_data=expression_filtered,
                           
                           # Dataframe of the feature data of the genes
                           expression_feature_data=expression_feature_data,
                           
                           # Dataframe of the splicing data
                           splicing_data=splicing, 
    
                           # Dataframe of the feature data of the splicing events
                           splicing_feature_data=splicing_feature_data,
    
                           # Specify "gene_name" as the column we want to rename the splicing ids to
                           splicing_feature_rename_col="gene_name",
    
                           # Specify "gene_name" as the column that links splicing ids to expression ids
                           splicing_feature_expression_id_col="gene_name",
                           
                           # Dataframe of the mapping stats data
                           mapping_stats_data=mapping_stats, 
                           
                           # Which column in "mapping_stats" has the number of reads
                           mapping_stats_number_mapped_col='PF_READS')

.. parsed-literal::

    2014-12-10 15:36:38	Initializing Study
    2014-12-10 15:36:38	Initializing Predictor configuration manager for Study
    2014-12-10 15:36:38	Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
    2014-12-10 15:36:38	Added ExtraTreesClassifier to default predictors
    2014-12-10 15:36:38	Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
    2014-12-10 15:36:38	Added ExtraTreesRegressor to default predictors
    2014-12-10 15:36:38	Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
    2014-12-10 15:36:38	Added GradientBoostingClassifier to default predictors
    2014-12-10 15:36:38	Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
    2014-12-10 15:36:38	Added GradientBoostingRegressor to default predictors
    2014-12-10 15:36:38	Loading metadata
    2014-12-10 15:36:38	Loading expression data
    2014-12-10 15:36:38	Initializing expression
    2014-12-10 15:36:38	Done initializing expression
    2014-12-10 15:36:38	Loading splicing data
    2014-12-10 15:36:38	Initializing splicing
    2014-12-10 15:36:38	Done initializing splicing
    2014-12-10 15:36:38	Successfully initialized a Study object!


.. parsed-literal::

    No phenotype to color mapping was provided, so coming up with reasonable defaults
    No phenotype to marker (matplotlib plotting symbol) was provided, so each phenotype will be plotted as a circle in the PCA visualizations.


As a side note, you can save this study to disk now, so you can
"``embark``" later:

.. code:: python

    study.save('shalek2013')

.. parsed-literal::

    Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json

Note that this is saved to my home directory, in
``~/flotilla_projects/<study_name>/`` (the "``~``" stands for my "home
directory", in this case ``/Users/olga``). This will be saved in your
home directory, too.

The ``datapackage.json`` file is what holds all the information relative
to the study, and loosely follows the `datapackage
spec <http://data.okfn.org/doc/data-package>`_ created by the Open
Knowledge Foundation.

.. code:: python

    cat /Users/olga/flotilla_projects/shalek2013/datapackage.json

.. parsed-literal::

    {
      "name": "shalek2013", 
      "title": null, 
      "datapackage_version": "0.1.1", 
      "sources": null, 
      "licenses": null, 
      "resources": [
        {
          "path": "splicing.csv.gz", 
          "format": "csv", 
          "name": "splicing", 
          "compression": "gzip"
        }, 
        {
          "number_mapped_col": "PF_READS", 
          "path": "mapping_stats.csv.gz", 
          "format": "csv", 
          "name": "mapping_stats", 
          "compression": "gzip"
        }, 
        {
          "name": "expression_feature", 
          "format": "csv", 
          "rename_col": null, 
          "ignore_subset_cols": [], 
          "path": "expression_feature.csv.gz", 
          "compression": "gzip"
        }, 
        {
          "name": "expression", 
          "log_base": null, 
          "format": "csv", 
          "thresh": -Infinity, 
          "plus_one": false, 
          "path": "expression.csv.gz", 
          "compression": "gzip"
        }, 
        {
          "name": "splicing_feature", 
          "format": "csv", 
          "rename_col": "gene_name", 
          "ignore_subset_cols": [], 
          "path": "splicing_feature.csv.gz", 
          "expression_id_col": "gene_name", 
          "compression": "gzip"
        }, 
        {
          "pooled_col": "pooled", 
          "name": "metadata", 
          "phenotype_to_marker": {
            "BDMC": "o"
          }, 
          "format": "csv", 
          "minimum_samples": 0, 
          "phenotype_to_color": {
            "BDMC": "#1b9e77"
          }, 
          "path": "metadata.csv.gz", 
          "phenotype_col": "phenotype", 
          "phenotype_order": [
            "BDMC"
          ], 
          "compression": "gzip"
        }
      ]
    }

One thing to note is that when you save, the version number is bumped
up. ``study.version`` (the one we just made) is ``0.1.0``, but the one
we saved is ``0.1.1``, since we could have made some changes to the
data.

Let's look at what else is in this folder:

.. code:: python

    ls /Users/olga/flotilla_projects/shalek2013

.. parsed-literal::

    datapackage.json           expression_feature.csv     mapping_stats.csv.gz       splicing.csv               splicing_feature.csv.gz
    expression.csv             expression_feature.csv.gz  metadata.csv               splicing.csv.gz
    expression.csv.gz          mapping_stats.csv          metadata.csv.gz            splicing_feature.csv


So this is where all the other files are. Good to know!

We can "embark" on this newly-saved study now very painlessly, without
having to open and process all those files again:

.. code:: python

    study2 = flotilla.embark('shalek2013')

.. parsed-literal::

    2014-12-10 15:34:27	Reading datapackage from /Users/olga/flotilla_projects/shalek2013/datapackage.json
    2014-12-10 15:34:27	Parsing datapackage to create a Study object
    2014-12-10 15:34:27	Initializing Study
    2014-12-10 15:34:27	Initializing Predictor configuration manager for Study
    2014-12-10 15:34:27	Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
    2014-12-10 15:34:27	Added ExtraTreesClassifier to default predictors
    2014-12-10 15:34:27	Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
    2014-12-10 15:34:27	Added ExtraTreesRegressor to default predictors
    2014-12-10 15:34:27	Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
    2014-12-10 15:34:27	Added GradientBoostingClassifier to default predictors
    2014-12-10 15:34:27	Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
    2014-12-10 15:34:27	Added GradientBoostingRegressor to default predictors
    2014-12-10 15:34:27	Loading metadata
    2014-12-10 15:34:27	Loading expression data
    2014-12-10 15:34:27	Initializing expression
    2014-12-10 15:34:27	Done initializing expression
    2014-12-10 15:34:27	Loading splicing data
    2014-12-10 15:34:27	Initializing splicing
    2014-12-10 15:34:27	Done initializing splicing
    2014-12-10 15:34:27	Successfully initialized a Study object!


Now we can start creating figures!

Figure 1
--------

Here, we will attempt to re-create the sub-panels in `Figure
1 <http://www.nature.com/nature/journal/v498/n7453/fig_tab/nature12172_F1.html>`_,
where the original is:

.. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f1.2.jpg
   :align: center
   :alt: Original Figure 1

   Original Figure 1

Figure 1a
~~~~~~~~~

.. code:: python

    study.plot_two_samples('P1', 'P2')

.. parsed-literal::

    /usr/local/lib/python2.7/site-packages/matplotlib/figure.py:1644: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
      warnings.warn("This figure includes Axes that are not "


.. image:: shalek2013_files/shalek2013_72_1.png


Without flotilla, you would do
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    import seaborn as sns
    sns.set_style('ticks')
    
    x = expression_filtered.ix['P1']
    y = expression_filtered.ix['P2']
    jointgrid = sns.jointplot(x, y, joint_kws=dict(alpha=0.5))
    xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis()
    jointgrid.ax_joint.set_xlim(0, xmax)
    jointgrid.ax_joint.set_ylim(0, ymax);


.. image:: shalek2013_files/shalek2013_74_0.png


Figure 1b
~~~~~~~~~

Paper: :math:`r=0.54`\ . Not sure at all what's going on here.

.. code:: python

    study.plot_two_samples('S1', 'S2')


.. image:: shalek2013_files/shalek2013_77_0.png


Without ``flotilla``
^^^^^^^^^^^^^^^^^^^^

.. code:: python

    import seaborn as sns
    sns.set_style('ticks')
    
    x = expression_filtered.ix['S1']
    y = expression_filtered.ix['S2']
    jointgrid = sns.jointplot(x, y, joint_kws=dict(alpha=0.5))
    
    # Adjust xmin, ymin to 0
    xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis()
    jointgrid.ax_joint.set_xlim(0, xmax)
    jointgrid.ax_joint.set_ylim(0, ymax);


.. image:: shalek2013_files/shalek2013_79_0.png


By the way, you can do other kinds of plots with ``flotilla``, like a
kernel density estimate ("``kde``") plot:

.. code:: python

    study.plot_two_samples('S1', 'S2', kind='kde')


.. image:: shalek2013_files/shalek2013_81_0.png


Or a binned hexagon plot ("``hexbin"``):

.. code:: python

    study.plot_two_samples('S1', 'S2', kind='hexbin')


.. image:: shalek2013_files/shalek2013_83_0.png


Any inputs that are valid to ``seaborn``'s
```jointplot`` <http://web.stanford.edu/~mwaskom/software/seaborn/generated/seaborn.jointplot.html#seaborn.jointplot>`_
are valid.

Figure 1c
~~~~~~~~~

.. code:: python

    x = study.expression.data.ix['P1']
    y = study.expression.singles.mean()
    y.name = "Average singles"
    
    jointgrid = sns.jointplot(x, y, joint_kws=dict(alpha=0.5))
    
    # Adjust xmin, ymin to 0
    xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis()
    jointgrid.ax_joint.set_xlim(0, xmax)
    jointgrid.ax_joint.set_ylim(0, ymax);


.. image:: shalek2013_files/shalek2013_86_0.png


Figure 2
--------

Next, we will attempt to recreate the figures from `Figure
2 <http://www.nature.com/nature/journal/v498/n7453/fig_tab/nature12172_F2.html>`_:

.. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f2.2.jpg
   :align: center
   :alt: Original figure 2

   Original figure 2

Figure 2a
~~~~~~~~~

For this figure, we will need the "LPS Response" and "Housekeeping" gene
annotations, from the ``expression_feature_data`` that we created.

.. code:: python

    # Get colors for plotting the gene categories
    dark2 = sns.color_palette('Dark2')
    
    singles = study.expression.singles
    # Get only gene categories for genes in the singles data
    singles, gene_categories = singles.align(study.expression.feature_data.gene_category, join='left', axis=1)
    
    mean = singles.mean()
    std = singles.std()
    
    jointgrid = sns.jointplot(mean, std, color='#262626', joint_kws=dict(alpha=0.5))
    
    for i, (category, s) in enumerate(gene_categories.groupby(gene_categories)):
        jointgrid.ax_joint.plot(mean[s.index], std[s.index], 'o', color=dark2[i], markersize=5)
    
    jointgrid.ax_joint.set_xlabel('Standard deviation in single cells $\mu$')
    jointgrid.ax_joint.set_ylabel('Average expression in single cells $\sigma$')
    
    xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis()
    vmax = max(xmax, ymax)
    vmin = min(xmin, ymin)
    jointgrid.ax_joint.plot([0, vmax], [0, vmax], color='steelblue')
    jointgrid.ax_joint.plot([0, vmax], [0, .25*vmax], color='grey')
    jointgrid.ax_joint.set_xlim(0, xmax)
    jointgrid.ax_joint.set_ylim(0, ymax)
    
    jointgrid.ax_joint.fill_betweenx((ymin, ymax), 0, np.log(250), alpha=0.5, zorder=-1);


.. image:: shalek2013_files/shalek2013_91_0.png


I couldn't find the data for the ``hESC``s for the right-side panel of
Fig. 2a, so I couldn't remake the figure.

Figure 2b
~~~~~~~~~

In the paper, they use *"522 most highly expressed genes (single-cell
average TPM > 250)"*, but I wasn't able to replicate their numbers. If I
use the pre-filtered expression data that I fed into flotilla, then I
get 297 genes:

.. code:: python

    mean = study.expression.singles.mean()
    highly_expressed_genes = mean.index[mean > np.log(250 + 1)]
    len(highly_expressed_genes)


.. parsed-literal::

    297


Which is much less. If I use the original, unfiltered data, then I get
the *"522"* number, but this seems strange because they did the
filtering step of *"discarded genes not appreciably expressed
(transcripts per million (TPM) > 1) in at least three individual cells,
retaining 6,313 genes for further analysis"*, and yet they went back to
the original data to get this new subset.

.. code:: python

    expression.ix[:, expression.ix[singles_ids].mean() > 250].shape


.. parsed-literal::

    (21, 522)


.. code:: python

    expression_highly_expressed = np.log(expression.ix[singles_ids, expression.ix[singles_ids].mean() > 250] + 1)
    
    mean = expression_highly_expressed.mean()
    
    std = expression_highly_expressed.std()
    
    mean_bins = pd.cut(mean, bins=np.arange(0, 11, 1))
    
    # Coefficient of variation
    cv = std/mean
    cv.sort()
    
    genes = mean.index
    
    
    # for name, df in shalek2013.expression.singles.groupby(dict(zip(genes, mean_bins)), axis=1):
    def calculate_cells_per_tpm_per_cv(df, cv):
        df = df[df > 1]
        df_aligned, cv_aligned = df.align(cv, join='inner', axis=1)
        cv_aligned.sort()
        n_cells = pd.Series(0, index=cv.index)
        n_cells[cv_aligned.index] = df_aligned.ix[:, cv_aligned.index].count()
        return n_cells
    
    grouped = expression_highly_expressed.groupby(dict(zip(genes, mean_bins)), axis=1)
    cells_per_tpm_per_cv = grouped.apply(calculate_cells_per_tpm_per_cv, cv=cv)
Here's how you would make the original figure from the paper:

.. code:: python

    import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots(figsize=(10, 10))
    sns.heatmap(cells_per_tpm_per_cv, linewidth=0, ax=ax, yticklabels=False)
    ax.set_yticks([])
    ax.set_xlabel('ln(TPM, binned)');


.. image:: shalek2013_files/shalek2013_100_0.png


Doesn't quite look the same. Maybe the y-axis labels were opposite, and
higher up on the y-axis was less variant? Because I see a blob of color
for (1,2] TPM (by the way, the figure in the paper is not TPM+1 as
previous figures were)

This is how you would make a modified version of the figure, which also
plots the coefficient of variation on a side-plot, which I like because
it shows the CV changes directly on the heatmap. Also, technically this
is :math:`\ln`\ (TPM+1).

.. code:: python

    from matplotlib import gridspec
    
    fig = plt.figure(figsize=(12, 10))
    
    gs = gridspec.GridSpec(1, 2, wspace=0.01, hspace=0.01, width_ratios=[.2, 1])
    cv_ax = fig.add_subplot(gs[0, 0])
    heatmap_ax = fig.add_subplot(gs[0, 1])
    
    sns.heatmap(cells_per_tpm_per_cv, linewidth=0, ax=heatmap_ax)
    heatmap_ax.set_yticks([])
    heatmap_ax.set_xlabel('$\ln$(TPM+1), binned')
    
    y = np.arange(cv.shape[0])
    cv_ax.set_xscale('log')
    cv_ax.plot(cv, y, color='#262626')
    cv_ax.fill_betweenx(cv, np.zeros(cv.shape), y, color='#262626', alpha=0.5)
    cv_ax.set_ylim(0, y.max())
    cv_ax.set_xlabel('CV = $\mu/\sigma$')
    cv_ax.set_yticks([])
    sns.despine(ax=cv_ax, left=True, right=False)


.. image:: shalek2013_files/shalek2013_102_0.png


Figure 3
--------

We will attempt to re-create the sub-panel figures from `Figure
3 <http://www.nature.com/nature/journal/v498/n7453/fig_tab/nature12172_F3.html>`_:

.. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f3.2.jpg
   :align: center
   :alt: Original Figure 3

   Original Figure 3
Since we can't re-do the microscopy (Figure 3a) or the RNA-FISH counts
(Figure 3c), we will make Figures 3b. These histograms are simple to do
outside of ``flotilla``, so we do not have them within flotilla.

Figure 3b, top panel
~~~~~~~~~~~~~~~~~~~~

.. code:: python

    fig, ax = plt.subplots()
    sns.distplot(study.splicing.singles.values.flat, bins=np.arange(0, 1.05, 0.05), ax=ax)
    ax.set_xlim(0, 1)
    sns.despine()


.. image:: shalek2013_files/shalek2013_106_0.png


Figure 3b, bottom panel
~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

    fig, ax = plt.subplots()
    sns.distplot(study.splicing.pooled.values.flat, bins=np.arange(0, 1.05, 0.05), ax=ax, color='grey')
    ax.set_xlim(0, 1)
    sns.despine()


.. image:: shalek2013_files/shalek2013_108_0.png


Figure 4
--------

We will attempt to re-create the sub-panel figures from `Figure
4 <http://www.nature.com/nature/journal/v498/n7453/fig_tab/nature12172_F4.html>`_:

.. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f4.2.jpg
   :align: center
   :alt: Original Figure 4

   Original Figure 4

Figure 4a
~~~~~~~~~

Here, we can use the "``interactive_pca``" function we have to explore
different dimensionality reductions on the data.

.. code:: python

    study.interactive_pca()

.. parsed-literal::

    featurewise : False
    y_pc : 2
    data_type : expression
    show_point_labels : False
    sample_subset : all_samples
    feature_subset : variant
    plot_violins : False
    x_pc : 1
    list_link : 


.. parsed-literal::

    <function flotilla.visualize.ipython_interact.do_interact>


.. image:: shalek2013_files/shalek2013_113_2.png


A "sequences shortened" version of this is available as a gif:

.. figure:: http://i.imgur.com/fJKPQ7W.gif
   :align: center
   :alt: Imgur

   Imgur

Equivalently, I could have written out the plotting command by hand,
instead of using ``study.interactive_pca``:

.. code:: python

    study.plot_pca(feature_subset='gene_category: LPS Response', sample_subset='not (pooled)', plot_violins=False, show_point_labels=True)


.. parsed-literal::

    <flotilla.visualize.decomposition.DecompositionViz at 0x1125d52d0>


.. image:: shalek2013_files/shalek2013_116_1.png


Mark immature cells as a new subset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As in the paper, the cells S12, S13, and S16 appear in a cluster that is
separate from the remaining cells. From the paper, these were the
"matured" bone-marrow derived dendritic cells, after stimulation with a
lipopolysaccharide. We can mark these as mature in our metadata,

.. code:: python

    mature = ['S12', 'S13', 'S16']
    study.metadata.data['maturity'] = metadata.index.map(lambda x: 'mature' if x in mature else 'immature')
    study.metadata.data.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>phenotype</th>
          <th>pooled</th>
          <th>outlier</th>
          <th>maturity</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td> BDMC</td>
          <td> False</td>
          <td> False</td>
          <td> immature</td>
        </tr>
        <tr>
          <th>S2</th>
          <td> BDMC</td>
          <td> False</td>
          <td> False</td>
          <td> immature</td>
        </tr>
        <tr>
          <th>S3</th>
          <td> BDMC</td>
          <td> False</td>
          <td> False</td>
          <td> immature</td>
        </tr>
        <tr>
          <th>S4</th>
          <td> BDMC</td>
          <td> False</td>
          <td> False</td>
          <td> immature</td>
        </tr>
        <tr>
          <th>S5</th>
          <td> BDMC</td>
          <td> False</td>
          <td> False</td>
          <td> immature</td>
        </tr>
      </tbody>
    </table>
    </div>


Then, we can set **maturity** as the column we use for coloring the PCA,
since before it was the "phenotype" column.

.. code:: python

    study.metadata.phenotype_col = 'maturity'
    study.save('shalek2013')
    study = flotilla.embark('shalek2013')

.. parsed-literal::

    Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json2014-12-10 15:41:07	Reading datapackage from /Users/olga/flotilla_projects/shalek2013/datapackage.json
    2014-12-10 15:41:07	Parsing datapackage to create a Study object
    2014-12-10 15:41:07	Initializing Study
    2014-12-10 15:41:07	Initializing Predictor configuration manager for Study
    2014-12-10 15:41:07	Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
    2014-12-10 15:41:07	Added ExtraTreesClassifier to default predictors
    2014-12-10 15:41:07	Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
    2014-12-10 15:41:07	Added ExtraTreesRegressor to default predictors
    2014-12-10 15:41:07	Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
    2014-12-10 15:41:07	Added GradientBoostingClassifier to default predictors
    2014-12-10 15:41:07	Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
    2014-12-10 15:41:07	Added GradientBoostingRegressor to default predictors
    2014-12-10 15:41:07	Loading metadata
    2014-12-10 15:41:07	Loading expression data
    2014-12-10 15:41:07	Initializing expression
    2014-12-10 15:41:07	Done initializing expression
    2014-12-10 15:41:07	Loading splicing data
    2014-12-10 15:41:07	Initializing splicing
    2014-12-10 15:41:07	Done initializing splicing
    2014-12-10 15:41:07	Successfully initialized a Study object!


.. parsed-literal::

    No color was assigned to the phenotype immature, assigning a random colorNo color was assigned to the phenotype mature, assigning a random colorimmature does not have marker style, falling back on "o" (circle)mature does not have marker style, falling back on "o" (circle)

.. code:: python

    study.plot_pca(feature_subset='gene_category: LPS Response', sample_subset='not (pooled)', plot_violins=False, show_point_labels=True)


.. parsed-literal::

    <flotilla.visualize.decomposition.DecompositionViz at 0x118468090>


.. image:: shalek2013_files/shalek2013_122_1.png


.. code:: python

    study.save('shalek2013')

.. parsed-literal::

    Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json

Without ``flotilla``, ``plot_pca`` is quite a bit of code:

.. code:: python

    import sys
    from collections import defaultdict
    from itertools import cycle
    import math
    
    from sklearn import decomposition
    from sklearn.preprocessing import StandardScaler
    import pandas as pd
    from matplotlib.gridspec import GridSpec, GridSpecFromSubplotSpec
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import seaborn as sns
    
    from flotilla.visualize.color import dark2
    from flotilla.visualize.generic import violinplot
    
    
    class DataFrameReducerBase(object):
        """
    
        Just like scikit-learn's reducers, but with prettied up DataFrames.
    
        """
    
        def __init__(self, df, n_components=None, **decomposer_kwargs):
    
            # This magically initializes the reducer like DataFramePCA or DataFrameNMF
            if df.shape[1] <= 3:
                raise ValueError(
                    "Too few features (n={}) to reduce".format(df.shape[1]))
            super(DataFrameReducerBase, self).__init__(n_components=n_components,
                                                       **decomposer_kwargs)
            self.reduced_space = self.fit_transform(df)
    
        def relabel_pcs(self, x):
            return "pc_" + str(int(x) + 1)
    
        def fit(self, X):
            try:
                assert type(X) == pd.DataFrame
            except AssertionError:
                sys.stdout.write("Try again as a pandas DataFrame")
                raise ValueError('Input X was not a pandas DataFrame, '
                                 'was of type {} instead'.format(str(type(X))))
    
            self.X = X
            super(DataFrameReducerBase, self).fit(X)
            self.components_ = pd.DataFrame(self.components_,
                                            columns=self.X.columns).rename_axis(
                self.relabel_pcs, 0)
            try:
                self.explained_variance_ = pd.Series(
                    self.explained_variance_).rename_axis(self.relabel_pcs, 0)
                self.explained_variance_ratio_ = pd.Series(
                    self.explained_variance_ratio_).rename_axis(self.relabel_pcs,
                                                                0)
            except AttributeError:
                pass
    
            return self
    
        def transform(self, X):
            component_space = super(DataFrameReducerBase, self).transform(X)
            if type(self.X) == pd.DataFrame:
                component_space = pd.DataFrame(component_space,
                                               index=X.index).rename_axis(
                    self.relabel_pcs, 1)
            return component_space
    
        def fit_transform(self, X):
            try:
                assert type(X) == pd.DataFrame
            except:
                sys.stdout.write("Try again as a pandas DataFrame")
                raise ValueError('Input X was not a pandas DataFrame, '
                                 'was of type {} instead'.format(str(type(X))))
            self.fit(X)
            return self.transform(X)
    
    
    class DataFramePCA(DataFrameReducerBase, decomposition.PCA):
        pass
    
    
    class DataFrameNMF(DataFrameReducerBase, decomposition.NMF):
        def fit(self, X):
            """
            duplicated fit code for DataFrameNMF because sklearn's NMF cheats for
            efficiency and calls fit_transform. MRO resolves the closest
            (in this package)
            _single_fit_transform first and so there's a recursion error:
    
                def fit(self, X, y=None, **params):
                    self._single_fit_transform(X, **params)
                    return self
            """
    
            try:
                assert type(X) == pd.DataFrame
            except:
                sys.stdout.write("Try again as a pandas DataFrame")
                raise ValueError('Input X was not a pandas DataFrame, '
                                 'was of type {} instead'.format(str(type(X))))
    
            self.X = X
            # notice this is fit_transform, not fit
            super(decomposition.NMF, self).fit_transform(X)
            self.components_ = pd.DataFrame(self.components_,
                                            columns=self.X.columns).rename_axis(
                self.relabel_pcs, 0)
            return self
    
    
    class DataFrameICA(DataFrameReducerBase, decomposition.FastICA):
        pass
    
    class DecompositionViz(object):
        """
        Plots the reduced space from a decomposed dataset. Does not perform any
        reductions of its own
        """
    
        def __init__(self, reduced_space, components_,
                     explained_variance_ratio_,
                     feature_renamer=None, groupby=None,
                     singles=None, pooled=None, outliers=None,
                     featurewise=False,
                     order=None, violinplot_kws=None,
                     data_type='expression', label_to_color=None,
                     label_to_marker=None,
                     scale_by_variance=True, x_pc='pc_1',
                     y_pc='pc_2', n_vectors=20, distance='L1',
                     n_top_pc_features=50, max_char_width=30):
            """Plot the results of a decomposition visualization
    
            Parameters
            ----------
            reduced_space : pandas.DataFrame
                A (n_samples, n_dimensions) DataFrame of the post-dimensionality
                reduction data
            components_ : pandas.DataFrame
                A (n_features, n_dimensions) DataFrame of how much each feature
                contributes to the components (trailing underscore to be
                consistent with scikit-learn)
            explained_variance_ratio_ : pandas.Series
                A (n_dimensions,) Series of how much variance each component
                explains. (trailing underscore to be consistent with scikit-learn)
            feature_renamer : function, optional
                A function which takes the name of the feature and renames it,
                e.g. from an ENSEMBL ID to a HUGO known gene symbol. If not
                provided, the original name is used.
            groupby : mapping function | dict, optional
                A mapping of the samples to a label, e.g. sample IDs to
                phenotype, for the violinplots. If None, all samples are treated
                the same and are colored the same.
            singles : pandas.DataFrame, optional
                For violinplots only. If provided and 'plot_violins' is True,
                will plot the raw (not reduced) measurement values as violin plots.
            pooled : pandas.DataFrame, optional
                For violinplots only. If provided, pooled samples are plotted as
                black dots within their label.
            outliers : pandas.DataFrame, optional
                For violinplots only. If provided, outlier samples are plotted as
                a grey shadow within their label.
            featurewise : bool, optional
                If True, then the "samples" are features, e.g. genes instead of
                samples, and the "features" are the samples, e.g. the cells
                instead of the gene ids. Essentially, the transpose of the
                original matrix. If True, then violins aren't plotted. (default
                False)
            order : list-like
                The order of the labels for the violinplots, e.g. if the data is
                from a differentiation timecourse, then this would be the labels
                of the phenotypes, in the differentiation order.
            violinplot_kws : dict
                Any additional parameters to violinplot
            data_type : 'expression' | 'splicing', optional
                For violinplots only. The kind of data that was originally used
                for the reduction. (default 'expression')
            label_to_color : dict, optional
                A mapping of the label, e.g. the phenotype, to the desired
                plotting color (default None, auto-assigned with the groupby)
            label_to_marker : dict, optional
                A mapping of the label, e.g. the phenotype, to the desired
                plotting symbol (default None, auto-assigned with the groupby)
            scale_by_variance : bool, optional
                If True, scale the x- and y-axes by their explained_variance_ratio_
                (default True)
            {x,y}_pc : str, optional
                Principal component to plot on the x- and y-axis. (default "pc_1"
                and "pc_2")
            n_vectors : int, optional
                Number of vectors to plot of the principal components. (default 20)
            distance : 'L1' | 'L2', optional
                The distance metric to use to plot the vector lengths. L1 is
                "Cityblock", i.e. the sum of the x and y coordinates, and L2 is
                the traditional Euclidean distance. (default "L1")
            n_top_pc_features : int, optional
                THe number of top features from the principal components to plot.
                (default 50)
            max_char_width : int, optional
                Maximum character width of a feature name. Useful for crazy long
                feature IDs like MISO IDs
            """
            self.reduced_space = reduced_space
            self.components_ = components_
            self.explained_variance_ratio_ = explained_variance_ratio_
    
            self.singles = singles
            self.pooled = pooled
            self.outliers = outliers
    
            self.groupby = groupby
            self.order = order
            self.violinplot_kws = violinplot_kws if violinplot_kws is not None \
                else {}
            self.data_type = data_type
            self.label_to_color = label_to_color
            self.label_to_marker = label_to_marker
            self.n_vectors = n_vectors
            self.x_pc = x_pc
            self.y_pc = y_pc
            self.pcs = (self.x_pc, self.y_pc)
            self.distance = distance
            self.n_top_pc_features = n_top_pc_features
            self.featurewise = featurewise
            self.feature_renamer = feature_renamer
            self.max_char_width = max_char_width
    
            if self.label_to_color is None:
                colors = cycle(dark2)
    
                def color_factory():
                    return colors.next()
    
                self.label_to_color = defaultdict(color_factory)
    
            if self.label_to_marker is None:
                markers = cycle(['o', '^', 's', 'v', '*', 'D', 'h'])
    
                def marker_factory():
                    return markers.next()
    
                self.label_to_marker = defaultdict(marker_factory)
    
            if self.groupby is None:
                self.groupby = dict.fromkeys(self.reduced_space.index, 'all')
            self.grouped = self.reduced_space.groupby(self.groupby, axis=0)
            if order is not None:
                self.color_ordered = [self.label_to_color[x] for x in self.order]
            else:
                self.color_ordered = [self.label_to_color[x] for x in
                                      self.grouped.groups]
    
            self.loadings = self.components_.ix[[self.x_pc, self.y_pc]]
    
            # Get the explained variance
            if explained_variance_ratio_ is not None:
                self.vars = explained_variance_ratio_[[self.x_pc, self.y_pc]]
            else:
                self.vars = pd.Series([1., 1.], index=[self.x_pc, self.y_pc])
    
            if scale_by_variance:
                self.loadings = self.loadings.multiply(self.vars, axis=0)
    
            # sort features by magnitude/contribution to transformation
            reduced_space = self.reduced_space[[self.x_pc, self.y_pc]]
            farthest_sample = reduced_space.apply(np.linalg.norm, axis=0).max()
            whole_space = self.loadings.apply(np.linalg.norm).max()
            scale = .25 * farthest_sample / whole_space
            self.loadings *= scale
    
            ord = 2 if self.distance == 'L2' else 1
            self.magnitudes = self.loadings.apply(np.linalg.norm, ord=ord)
            self.magnitudes.sort(ascending=False)
    
            self.top_features = set([])
            self.pc_loadings_labels = {}
            self.pc_loadings = {}
            for pc in self.pcs:
                x = self.components_.ix[pc].copy()
                x.sort(ascending=True)
                half_features = int(self.n_top_pc_features / 2)
                if len(x) > self.n_top_pc_features:
                    a = x[:half_features]
                    b = x[-half_features:]
                    labels = np.r_[a.index, b.index]
                    self.pc_loadings[pc] = np.r_[a, b]
                else:
                    labels = x.index
                    self.pc_loadings[pc] = x
    
                self.pc_loadings_labels[pc] = labels
                self.top_features.update(labels)
    
        def __call__(self, ax=None, title='', plot_violins=True,
                     show_point_labels=False,
                     show_vectors=True,
                     show_vector_labels=True,
                     markersize=10, legend=True):
            gs_x = 14
            gs_y = 12
    
            if ax is None:
                self.reduced_fig, ax = plt.subplots(1, 1, figsize=(20, 10))
                gs = GridSpec(gs_x, gs_y)
    
            else:
                gs = GridSpecFromSubplotSpec(gs_x, gs_y, ax.get_subplotspec())
                self.reduced_fig = plt.gcf()
    
            ax_components = plt.subplot(gs[:, :5])
            ax_loading1 = plt.subplot(gs[:, 6:8])
            ax_loading2 = plt.subplot(gs[:, 10:14])
    
            self.plot_samples(show_point_labels=show_point_labels,
                              title=title, show_vectors=show_vectors,
                              show_vector_labels=show_vector_labels,
                              markersize=markersize, legend=legend,
                              ax=ax_components)
            self.plot_loadings(pc=self.x_pc, ax=ax_loading1)
            self.plot_loadings(pc=self.y_pc, ax=ax_loading2)
            sns.despine()
            self.reduced_fig.tight_layout()
    
            if plot_violins and not self.featurewise and self.singles is not None:
                self.plot_violins()
            return self
    
        def shorten(self, x):
            if len(x) > self.max_char_width:
                return '{}...'.format(x[:self.max_char_width])
            else:
                return x
    
        def plot_samples(self, show_point_labels=True,
                         title='DataFramePCA', show_vectors=True,
                         show_vector_labels=True, markersize=10,
                         three_d=False, legend=True, ax=None):
    
            """
            Given a pandas dataframe, performs DataFramePCA and plots the results in a
            convenient single function.
    
            Parameters
            ----------
            groupby : groupby
                How to group the samples by color/label
            label_to_color : dict
                Group labels to a matplotlib color E.g. if you've already chosen
                specific colors to indicate a particular group. Otherwise will
                auto-assign colors
            label_to_marker : dict
                Group labels to matplotlib marker
            title : str
                title of the plot
            show_vectors : bool
                Whether or not to draw the vectors indicating the supporting
                principal components
            show_vector_labels : bool
                whether or not to draw the names of the vectors
            show_point_labels : bool
                Whether or not to label the scatter points
            markersize : int
                size of the scatter markers on the plot
            text_group : list of str
                Group names that you want labeled with text
            three_d : bool
                if you want hte plot in 3d (need to set up the axes beforehand)
    
            Returns
            -------
            For each vector in data:
            x, y, marker, distance
            """
            if ax is None:
                ax = plt.gca()
    
            # Plot the samples
            for name, df in self.grouped:
                color = self.label_to_color[name]
                marker = self.label_to_marker[name]
                x = df[self.x_pc]
                y = df[self.y_pc]
                ax.plot(x, y, color=color, marker=marker, linestyle='None',
                        label=name, markersize=markersize, alpha=0.75,
                        markeredgewidth=.1)
                try:
                    if not self.pooled.empty:
                        pooled_ids = x.index.intersection(self.pooled.index)
                        pooled_x, pooled_y = x[pooled_ids], y[pooled_ids]
                        ax.plot(pooled_x, pooled_y, 'o', color=color, marker=marker,
                                markeredgecolor='k', markeredgewidth=2,
                                label='{} pooled'.format(name),
                                markersize=markersize, alpha=0.75)
                except AttributeError:
                    pass
                try:
                    if not self.outliers.empty:
                        outlier_ids = x.index.intersection(self.outliers.index)
                        outlier_x, outlier_y = x[outlier_ids], y[outlier_ids]
                        ax.plot(outlier_x, outlier_y, 'o', color=color,
                                marker=marker,
                                markeredgecolor='lightgrey', markeredgewidth=5,
                                label='{} outlier'.format(name),
                                markersize=markersize, alpha=0.75)
                except AttributeError:
                    pass
                if show_point_labels:
                    for args in zip(x, y, df.index):
                        ax.text(*args)
    
            # Plot vectors, if asked
            if show_vectors:
                for vector_label in self.magnitudes[:self.n_vectors].index:
                    x, y = self.loadings[vector_label]
                    ax.plot([0, x], [0, y], color='k', linewidth=1)
                    if show_vector_labels:
                        x_offset = math.copysign(5, x)
                        y_offset = math.copysign(5, y)
                        horizontalalignment = 'left' if x > 0 else 'right'
                        if self.feature_renamer is not None:
                            renamed = self.feature_renamer(vector_label)
                        else:
                            renamed = vector_label
                        ax.annotate(renamed, (x, y),
                                    textcoords='offset points',
                                    xytext=(x_offset, y_offset),
                                    horizontalalignment=horizontalalignment)
    
            # Label x and y axes
            ax.set_xlabel(
                'Principal Component {} (Explains {:.2f}% Of Variance)'.format(
                    str(self.x_pc), 100 * self.vars[self.x_pc]))
            ax.set_ylabel(
                'Principal Component {} (Explains {:.2f}% Of Variance)'.format(
                    str(self.y_pc), 100 * self.vars[self.y_pc]))
            ax.set_title(title)
    
            if legend:
                ax.legend()
            sns.despine()
    
        def plot_loadings(self, pc='pc_1', n_features=50, ax=None):
            loadings = self.pc_loadings[pc]
            labels = self.pc_loadings_labels[pc]
    
            if ax is None:
                ax = plt.gca()
    
            ax.plot(loadings, np.arange(loadings.shape[0]), 'o')
    
            ax.set_yticks(np.arange(max(loadings.shape[0], n_features)))
            ax.set_title("Component " + pc)
    
            x_offset = max(loadings) * .05
            ax.set_xlim(left=loadings.min() - x_offset,
                        right=loadings.max() + x_offset)
    
            if self.feature_renamer is not None:
                labels = map(self.feature_renamer, labels)
            else:
                labels = labels
    
            labels = map(self.shorten, labels)
            # ax.set_yticklabels(map(shorten, labels))
            ax.set_yticklabels(labels)
            for lab in ax.get_xticklabels():
                lab.set_rotation(90)
            sns.despine(ax=ax)
    
        def plot_explained_variance(self, title="PCA explained variance"):
            """If the reducer is a form of PCA, then plot the explained variance
            ratio by the components.
            """
            # Plot the explained variance ratio
            assert self.explained_variance_ratio_ is not None
            import matplotlib.pyplot as plt
            import seaborn as sns
    
            fig, ax = plt.subplots()
            ax.plot(self.explained_variance_ratio_, 'o-')
    
            xticks = np.arange(len(self.explained_variance_ratio_))
            ax.set_xticks(xticks)
            ax.set_xticklabels(xticks + 1)
            ax.set_xlabel('Principal component')
            ax.set_ylabel('Fraction explained variance')
            ax.set_title(title)
            sns.despine()
    
        def plot_violins(self):
            """Make violinplots of each feature
    
            Must be called after plot_samples because it depends on the existence
            of the "self.magnitudes" attribute.
            """
            ncols = 4
            nrows = 1
            vector_labels = list(set(self.magnitudes[:self.n_vectors].index.union(
                pd.Index(self.top_features))))
            while ncols * nrows < len(vector_labels):
                nrows += 1
            self.violins_fig, axes = plt.subplots(nrows=nrows, ncols=ncols,
                                                  figsize=(4 * ncols, 4 * nrows))
    
            if self.feature_renamer is not None:
                renamed_vectors = map(self.feature_renamer, vector_labels)
            else:
                renamed_vectors = vector_labels
            labels = [(y, x) for (y, x) in sorted(zip(renamed_vectors,
                                                      vector_labels))]
    
            for (renamed, feature_id), ax in zip(labels, axes.flat):
                singles = self.singles[feature_id] if self.singles is not None \
                    else None
                pooled = self.pooled[feature_id] if self.pooled is not None else \
                    None
                outliers = self.outliers[feature_id] if self.outliers is not None \
                    else None
                title = '{}\n{}'.format(feature_id, renamed)
                violinplot(singles, pooled_data=pooled, outliers=outliers,
                           groupby=self.groupby, color_ordered=self.color_ordered,
                           order=self.order, title=title,
                           ax=ax, data_type=self.data_type,
                           **self.violinplot_kws)
    
            # Clear any unused axes
            for ax in axes.flat:
                # Check if the plotting space is empty
                if len(ax.collections) == 0 or len(ax.lines) == 0:
                    ax.axis('off')
            self.violins_fig.tight_layout()
    
    # Notice we're using the original data, nothing from "study"
    lps_response_genes = expression_feature_data.index[expression_feature_data.gene_category == 'LPS Response']
    subset = expression_filtered.ix[singles_ids, lps_response_genes].dropna(how='all', axis=1)
    subset_standardized = pd.DataFrame(StandardScaler().fit_transform(subset),
                                           index=subset.index, columns=subset.columns)
    
    
    pca = DataFramePCA(subset_standardized)
    visualizer = DecompositionViz(pca.reduced_space, pca.components_, pca.explained_variance_ratio_)
    visualizer();


.. image:: shalek2013_files/shalek2013_125_0.png


Figure 4b
~~~~~~~~~

.. code:: python

    lps_response_genes = study.expression.feature_subsets['gene_category: LPS Response']
    lps_response = study.expression.singles.ix[:, lps_response_genes].dropna(how='all', axis=1)
    lps_response.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th>GENE</th>
          <th>1110018G07RIK</th>
          <th>1110038F14RIK</th>
          <th>1200009I06RIK</th>
          <th>1600014C10RIK</th>
          <th>1810029B16RIK</th>
          <th>2210009G21RIK</th>
          <th>2810474O19RIK</th>
          <th>3110001I22RIK</th>
          <th>4921513D23RIK</th>
          <th>4930523C07RIK</th>
          <th>...</th>
          <th>ZC3H12C</th>
          <th>ZC3HAV1</th>
          <th>ZCCHC2</th>
          <th>ZCCHC6</th>
          <th>ZDHHC21</th>
          <th>ZFP36</th>
          <th>ZFP800</th>
          <th>ZHX2</th>
          <th>ZNFX1</th>
          <th>ZUFSP</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>S1</th>
          <td> 3.711442</td>
          <td> 0.000000</td>
          <td> 3.275468</td>
          <td> 0.000000</td>
          <td> 5.609305</td>
          <td> 0</td>
          <td> 0.000000</td>
          <td> 3.828860</td>
          <td> 1.314573</td>
          <td> 3.778275</td>
          <td>...</td>
          <td> 3.972904</td>
          <td> 3.509979</td>
          <td> 0.035344</td>
          <td> 3.042277</td>
          <td> 4.425735</td>
          <td> 4.092559</td>
          <td> 4.025124</td>
          <td> 0.779382</td>
          <td> 2.998800</td>
          <td> 0.000000</td>
        </tr>
        <tr>
          <th>S2</th>
          <td> 4.361671</td>
          <td> 0.147643</td>
          <td> 0.000000</td>
          <td> 0.000000</td>
          <td> 5.478071</td>
          <td> 0</td>
          <td> 3.407342</td>
          <td> 0.000000</td>
          <td> 1.531443</td>
          <td> 0.000000</td>
          <td>...</td>
          <td> 4.794306</td>
          <td> 4.984262</td>
          <td> 2.251330</td>
          <td> 1.018315</td>
          <td> 4.955713</td>
          <td> 0.356008</td>
          <td> 4.297776</td>
          <td> 0.032569</td>
          <td> 3.091207</td>
          <td> 5.000843</td>
        </tr>
        <tr>
          <th>S3</th>
          <td> 0.000000</td>
          <td> 3.737014</td>
          <td> 2.987093</td>
          <td> 0.063526</td>
          <td> 5.320993</td>
          <td> 0</td>
          <td> 3.372359</td>
          <td> 0.058163</td>
          <td> 1.105115</td>
          <td> 0.025043</td>
          <td>...</td>
          <td> 4.882749</td>
          <td> 0.807258</td>
          <td> 0.094925</td>
          <td> 0.126673</td>
          <td> 3.952273</td>
          <td> 1.956983</td>
          <td> 0.000000</td>
          <td> 0.000000</td>
          <td> 3.794063</td>
          <td> 2.928699</td>
        </tr>
        <tr>
          <th>S4</th>
          <td> 2.719587</td>
          <td> 0.000000</td>
          <td> 0.045823</td>
          <td> 0.000000</td>
          <td> 0.488049</td>
          <td> 0</td>
          <td> 5.127847</td>
          <td> 0.000000</td>
          <td> 2.303969</td>
          <td> 0.000000</td>
          <td>...</td>
          <td> 4.833354</td>
          <td> 4.538699</td>
          <td> 0.137427</td>
          <td> 2.025546</td>
          <td> 4.193989</td>
          <td> 2.372572</td>
          <td> 0.121924</td>
          <td> 0.000000</td>
          <td> 0.230278</td>
          <td> 0.430168</td>
        </tr>
        <tr>
          <th>S5</th>
          <td> 2.982073</td>
          <td> 0.000000</td>
          <td> 2.829152</td>
          <td> 0.000000</td>
          <td> 5.093188</td>
          <td> 0</td>
          <td> 0.065122</td>
          <td> 4.635671</td>
          <td> 1.015640</td>
          <td> 0.461296</td>
          <td>...</td>
          <td> 4.446634</td>
          <td> 0.157178</td>
          <td> 0.616401</td>
          <td> 0.000000</td>
          <td> 4.039816</td>
          <td> 0.000000</td>
          <td> 4.714087</td>
          <td> 1.565475</td>
          <td> 0.860254</td>
          <td> 4.866979</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 630 columns</p>
    </div>


.. code:: python

    lps_response_corr = lps_response.corr()
"Elbow method" for determining number of clusters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The authors state that they used the "Elbow method" to determine the
`number of cluster
centers <http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set>`_.
Essentially, you try a bunch of different :math:`k`\ , and see where
there is a flattening out of the metric, like an elbow. There's a few
different variations on which metric to use, such as using the average
distance to the cluster center, or the explained variance. Let's try the
distance to cluster center first, because ``scikit-learn`` makes it
easy.

.. code:: python

    from sklearn.cluster import KMeans
    
    ##### cluster data into K=1..10 clusters #####
    ks = np.arange(1, 11).astype(int)
    
    X = lps_response_corr.values
    
    kmeans = [KMeans(n_clusters=k).fit(X) for k in ks]
    
    # Scikit-learn makes this easy by computing the distance to the nearest center
    dist_to_center = [km.inertia_ for km in kmeans]
    
    fig, ax = plt.subplots()
    ax.plot(ks, dist_to_center, 'o-')
    ax.set_ylabel('Sum of distance to nearest cluster center')
    sns.despine()


.. image:: shalek2013_files/shalek2013_131_0.png


Not quite sure where the elbow is here. looks like there's a big drop
off after :math:`k=1`\ , but that could just be an illusion. Since they
didn't specify which version of the elbow method they used, I'm not
going to investigate this further, and just see if we can see what they
see with the :math:`k=5` clusters that they found was optimal.

.. code:: python

    kmeans = KMeans(n_clusters=5)
    lps_response_corr_clusters = kmeans.fit_predict(lps_response_corr.values)
    lps_response_corr_clusters


.. parsed-literal::

    array([3, 0, 4, 4, 1, 0, 3, 4, 2, 4, 1, 3, 2, 4, 3, 3, 1, 0, 1, 3, 1, 0, 2,
           1, 1, 3, 3, 2, 4, 4, 1, 4, 4, 1, 4, 1, 3, 4, 2, 0, 2, 4, 2, 3, 0, 4,
           1, 1, 4, 0, 0, 3, 4, 1, 1, 2, 1, 1, 1, 2, 0, 3, 4, 3, 3, 4, 2, 2, 4,
           3, 1, 4, 1, 3, 4, 2, 2, 4, 2, 3, 3, 3, 0, 0, 4, 1, 2, 2, 2, 0, 0, 3,
           0, 0, 4, 3, 3, 3, 3, 0, 0, 2, 1, 2, 1, 1, 2, 1, 2, 4, 2, 1, 1, 3, 4,
           4, 1, 2, 4, 3, 4, 2, 2, 2, 0, 4, 4, 1, 0, 2, 3, 3, 4, 4, 1, 1, 4, 3,
           2, 0, 1, 4, 2, 1, 4, 2, 4, 1, 0, 1, 1, 3, 3, 3, 3, 0, 0, 3, 1, 2, 2,
           3, 4, 0, 0, 4, 2, 2, 2, 3, 3, 3, 3, 1, 3, 3, 0, 1, 2, 0, 0, 1, 2, 4,
           1, 0, 3, 2, 0, 3, 1, 0, 0, 2, 4, 3, 0, 1, 1, 1, 3, 3, 2, 0, 3, 0, 4,
           4, 4, 3, 2, 3, 3, 0, 4, 3, 4, 3, 1, 0, 3, 3, 3, 3, 3, 0, 4, 0, 1, 3,
           3, 2, 4, 3, 4, 1, 1, 3, 0, 0, 2, 4, 2, 4, 4, 3, 0, 3, 0, 1, 4, 0, 0,
           1, 1, 4, 1, 1, 1, 0, 4, 3, 4, 3, 3, 3, 3, 1, 3, 4, 4, 2, 2, 0, 2, 2,
           1, 1, 1, 4, 1, 2, 4, 1, 2, 2, 1, 4, 1, 3, 0, 3, 2, 3, 1, 3, 3, 3, 2,
           0, 2, 2, 2, 2, 4, 2, 3, 2, 4, 3, 2, 2, 3, 0, 4, 1, 1, 1, 1, 1, 0, 4,
           0, 4, 4, 3, 0, 1, 1, 0, 0, 2, 0, 2, 1, 4, 3, 4, 1, 0, 3, 3, 1, 3, 2,
           2, 3, 1, 1, 2, 4, 4, 1, 0, 0, 3, 4, 2, 1, 3, 3, 1, 0, 1, 1, 3, 3, 2,
           3, 0, 1, 2, 3, 3, 0, 0, 0, 3, 4, 2, 2, 2, 2, 3, 3, 2, 1, 0, 0, 0, 0,
           1, 3, 4, 4, 1, 4, 3, 3, 0, 1, 1, 1, 3, 1, 3, 3, 1, 0, 4, 4, 4, 3, 3,
           3, 0, 3, 0, 2, 4, 0, 4, 1, 0, 1, 0, 0, 1, 0, 0, 2, 4, 0, 1, 3, 1, 3,
           3, 0, 0, 0, 4, 3, 0, 0, 2, 3, 4, 4, 2, 3, 1, 0, 4, 3, 2, 3, 3, 0, 0,
           2, 3, 0, 2, 0, 1, 1, 4, 3, 3, 0, 3, 4, 1, 0, 1, 4, 1, 4, 0, 4, 1, 0,
           3, 1, 3, 1, 4, 3, 2, 2, 3, 3, 0, 1, 4, 4, 0, 0, 4, 1, 2, 2, 3, 2, 4,
           0, 1, 3, 4, 2, 0, 0, 3, 3, 1, 1, 1, 3, 3, 0, 1, 3, 2, 3, 3, 1, 2, 1,
           0, 3, 1, 3, 4, 4, 0, 2, 4, 2, 3, 4, 3, 4, 3, 4, 2, 4, 0, 0, 4, 3, 2,
           2, 4, 1, 0, 2, 1, 3, 1, 3, 1, 2, 0, 0, 3, 1, 2, 0, 3, 3, 1, 2, 1, 1,
           4, 2, 1, 3, 4, 3, 2, 1, 0, 4, 0, 3, 1, 4, 2, 2, 1, 3, 4, 3, 0, 4, 3,
           4, 2, 2, 3, 2, 1, 1, 4, 2, 0, 0, 0, 3, 3, 3, 2, 2, 1, 0, 2, 3, 1, 4,
           4, 3, 2, 2, 2, 0, 2, 0, 2], dtype=int32)


Now let's create a dataframe with these genes in their cluster orders.

.. code:: python

    gene_to_cluster = dict(zip(lps_response_corr.columns, lps_response_corr_clusters))
    
    dfs = []
    for name, df in lps_response_corr.groupby(gene_to_cluster):
        dfs.append(df)
    lps_response_corr_ordered_by_clusters = pd.concat(dfs)
    
    # Make symmetric, since we created this dataframe by smashing rows on top of each other, we need to reorder the columns
    lps_response_corr_ordered_by_clusters = lps_response_corr_ordered_by_clusters.ix[:, lps_response_corr_ordered_by_clusters.index]
    lps_response_corr_ordered_by_clusters.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th>GENE</th>
          <th>1110038F14RIK</th>
          <th>2210009G21RIK</th>
          <th>A430084P05RIK</th>
          <th>AA960436</th>
          <th>AK141659</th>
          <th>AK163103</th>
          <th>ALCAM</th>
          <th>ALPK2</th>
          <th>ARMC8</th>
          <th>BC147527</th>
          <th>...</th>
          <th>TNFAIP2</th>
          <th>TNFSF4</th>
          <th>TOR1AIP1</th>
          <th>TRA2A</th>
          <th>TRIM26</th>
          <th>TRIM34</th>
          <th>TTC39C</th>
          <th>USP12</th>
          <th>ZC3H12C</th>
          <th>ZC3HAV1</th>
        </tr>
        <tr>
          <th>GENE</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>1110038F14RIK</th>
          <td> 1.000000</td>
          <td> 0.175230</td>
          <td> 0.043846</td>
          <td> 0.240304</td>
          <td> 0.150073</td>
          <td>-0.007459</td>
          <td>-0.075510</td>
          <td> 0.001210</td>
          <td> 0.078638</td>
          <td>-0.073983</td>
          <td>...</td>
          <td>-0.053588</td>
          <td>-0.085191</td>
          <td> 0.322774</td>
          <td> 0.096905</td>
          <td>-0.370932</td>
          <td> 0.451829</td>
          <td> 0.387727</td>
          <td>-0.257008</td>
          <td> 0.274102</td>
          <td>-0.163423</td>
        </tr>
        <tr>
          <th>2210009G21RIK</th>
          <td> 0.175230</td>
          <td> 1.000000</td>
          <td> 0.301786</td>
          <td> 0.454579</td>
          <td>-0.106546</td>
          <td>-0.122179</td>
          <td> 0.177472</td>
          <td> 0.215454</td>
          <td> 0.540303</td>
          <td> 0.078574</td>
          <td>...</td>
          <td> 0.230309</td>
          <td>-0.158622</td>
          <td> 0.019694</td>
          <td> 0.142045</td>
          <td> 0.053967</td>
          <td> 0.483106</td>
          <td>-0.085604</td>
          <td> 0.279262</td>
          <td> 0.153934</td>
          <td> 0.160710</td>
        </tr>
        <tr>
          <th>A430084P05RIK</th>
          <td> 0.043846</td>
          <td> 0.301786</td>
          <td> 1.000000</td>
          <td> 0.001150</td>
          <td> 0.060210</td>
          <td>-0.173020</td>
          <td> 0.150884</td>
          <td> 0.429134</td>
          <td> 0.131837</td>
          <td>-0.069652</td>
          <td>...</td>
          <td>-0.341757</td>
          <td>-0.296639</td>
          <td>-0.192074</td>
          <td>-0.360383</td>
          <td> 0.025340</td>
          <td> 0.033636</td>
          <td>-0.227960</td>
          <td>-0.166541</td>
          <td> 0.200579</td>
          <td> 0.064736</td>
        </tr>
        <tr>
          <th>AA960436</th>
          <td> 0.240304</td>
          <td> 0.454579</td>
          <td> 0.001150</td>
          <td> 1.000000</td>
          <td>-0.361780</td>
          <td> 0.206889</td>
          <td> 0.174208</td>
          <td> 0.075687</td>
          <td> 0.394432</td>
          <td> 0.163830</td>
          <td>...</td>
          <td> 0.175022</td>
          <td>-0.271395</td>
          <td> 0.272221</td>
          <td>-0.222182</td>
          <td> 0.181522</td>
          <td>-0.094028</td>
          <td> 0.218182</td>
          <td> 0.396040</td>
          <td>-0.159072</td>
          <td> 0.048122</td>
        </tr>
        <tr>
          <th>AK141659</th>
          <td> 0.150073</td>
          <td>-0.106546</td>
          <td> 0.060210</td>
          <td>-0.361780</td>
          <td> 1.000000</td>
          <td>-0.287830</td>
          <td>-0.370827</td>
          <td> 0.143026</td>
          <td>-0.019682</td>
          <td>-0.157671</td>
          <td>...</td>
          <td>-0.295535</td>
          <td> 0.194073</td>
          <td>-0.232992</td>
          <td> 0.061276</td>
          <td>-0.032583</td>
          <td> 0.411637</td>
          <td>-0.182131</td>
          <td> 0.018036</td>
          <td> 0.175434</td>
          <td>-0.288042</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 630 columns</p>
    </div>


The next step is to get the principal-component reduced data, using only
the LPS response genes. We can do this in ``flotilla`` using
``study.expression.reduce``.

.. code:: python

    reduced = study.expression.reduce(singles_ids, feature_ids=lps_response_genes)
We can get the principal components using ``reduced.components_``
(similar interface as ``scikit-learn``).

.. code:: python

    reduced.components_.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>MOV10</th>
          <th>PPAP2B</th>
          <th>LASS6</th>
          <th>TMCO3</th>
          <th>CPD</th>
          <th>AK138792</th>
          <th>TARM1</th>
          <th>P4HA1</th>
          <th>CD180</th>
          <th>SMG7</th>
          <th>...</th>
          <th>OAS1B</th>
          <th>OAS1G</th>
          <th>AK151815</th>
          <th>GTPBP2</th>
          <th>PRPF38A</th>
          <th>SLC7A11</th>
          <th>PCDH7</th>
          <th>GNA13</th>
          <th>PTPRJ</th>
          <th>ATF3</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>pc_1</th>
          <td> 0.035299</td>
          <td> 0.038725</td>
          <td>-0.006343</td>
          <td> 0.014219</td>
          <td> 0.033734</td>
          <td>-0.079831</td>
          <td> 0.032886</td>
          <td> 0.034783</td>
          <td> 0.033719</td>
          <td>-0.048453</td>
          <td>...</td>
          <td>-0.022490</td>
          <td> 0.031091</td>
          <td>-0.021397</td>
          <td> 0.034917</td>
          <td> 0.001745</td>
          <td> 0.058000</td>
          <td> 0.007748</td>
          <td> 0.000767</td>
          <td> 0.016012</td>
          <td> 0.018020</td>
        </tr>
        <tr>
          <th>pc_2</th>
          <td> 0.055310</td>
          <td> 0.002925</td>
          <td>-0.043986</td>
          <td>-0.024020</td>
          <td>-0.061957</td>
          <td>-0.016327</td>
          <td> 0.002882</td>
          <td>-0.003178</td>
          <td> 0.050055</td>
          <td> 0.038601</td>
          <td>...</td>
          <td> 0.012240</td>
          <td> 0.052127</td>
          <td> 0.009120</td>
          <td> 0.077015</td>
          <td> 0.072064</td>
          <td>-0.080902</td>
          <td>-0.056607</td>
          <td> 0.068444</td>
          <td>-0.072533</td>
          <td> 0.068088</td>
        </tr>
        <tr>
          <th>pc_3</th>
          <td> 0.000374</td>
          <td> 0.099514</td>
          <td>-0.039636</td>
          <td> 0.003997</td>
          <td>-0.000575</td>
          <td>-0.042212</td>
          <td>-0.056827</td>
          <td> 0.015571</td>
          <td>-0.039811</td>
          <td> 0.005398</td>
          <td>...</td>
          <td>-0.010524</td>
          <td>-0.009277</td>
          <td>-0.102462</td>
          <td>-0.043913</td>
          <td>-0.052513</td>
          <td>-0.030622</td>
          <td> 0.022607</td>
          <td>-0.002503</td>
          <td> 0.023997</td>
          <td>-0.054205</td>
        </tr>
        <tr>
          <th>pc_4</th>
          <td> 0.022491</td>
          <td> 0.002342</td>
          <td> 0.009422</td>
          <td>-0.034725</td>
          <td> 0.025866</td>
          <td>-0.009656</td>
          <td>-0.027689</td>
          <td>-0.089803</td>
          <td>-0.046888</td>
          <td> 0.002274</td>
          <td>...</td>
          <td>-0.003404</td>
          <td>-0.070307</td>
          <td>-0.007025</td>
          <td> 0.003407</td>
          <td>-0.048078</td>
          <td> 0.028099</td>
          <td> 0.032970</td>
          <td>-0.066284</td>
          <td> 0.010371</td>
          <td>-0.006108</td>
        </tr>
        <tr>
          <th>pc_5</th>
          <td>-0.025743</td>
          <td>-0.009200</td>
          <td>-0.030187</td>
          <td>-0.061283</td>
          <td> 0.010464</td>
          <td> 0.032668</td>
          <td> 0.012223</td>
          <td>-0.047623</td>
          <td>-0.047351</td>
          <td> 0.045909</td>
          <td>...</td>
          <td>-0.074817</td>
          <td> 0.044218</td>
          <td>-0.000884</td>
          <td>-0.000597</td>
          <td>-0.033893</td>
          <td>-0.018108</td>
          <td>-0.012669</td>
          <td>-0.025833</td>
          <td>-0.044248</td>
          <td>-0.001995</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 630 columns</p>
    </div>


.. code:: python

    pc_components = reduced.components_.ix[:2, lps_response_corr_ordered_by_clusters.index].T
    pc_components.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>pc_1</th>
          <th>pc_2</th>
        </tr>
        <tr>
          <th>GENE</th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>1110038F14RIK</th>
          <td>-0.007729</td>
          <td>-0.005858</td>
        </tr>
        <tr>
          <th>2210009G21RIK</th>
          <td>-0.006981</td>
          <td> 0.002916</td>
        </tr>
        <tr>
          <th>A430084P05RIK</th>
          <td> 0.014550</td>
          <td> 0.022191</td>
        </tr>
        <tr>
          <th>AA960436</th>
          <td> 0.002159</td>
          <td> 0.014470</td>
        </tr>
        <tr>
          <th>AK141659</th>
          <td> 0.016902</td>
          <td>-0.009151</td>
        </tr>
      </tbody>
    </table>
    </div>


.. code:: python

    import matplotlib as mpl
    
    fig = plt.figure(figsize=(12, 10))
    gs = gridspec.GridSpec(2, 2, wspace=0.1, hspace=0.1, width_ratios=[1, .2], height_ratios=[1, .1])
    corr_ax = fig.add_subplot(gs[0, 0])
    corr_cbar_ax = fig.add_subplot(gs[1, 0])
    pc_ax = fig.add_subplot(gs[0, 1:])
    pc_cbar_ax = fig.add_subplot(gs[1:, 1:])
    
    sns.heatmap(lps_response_corr_ordered_by_clusters, linewidth=0, ax=corr_ax, cbar_ax=corr_cbar_ax, 
                cbar_kws=dict(orientation='horizontal'))
    sns.heatmap(pc_components, cmap=mpl.cm.PRGn, linewidth=0, ax=pc_ax, cbar_ax=pc_cbar_ax,
                cbar_kws=dict(orientation='horizontal'))
    
    corr_ax.set_xlabel('')
    corr_ax.set_ylabel('')
    corr_ax.set_xticks([])
    corr_ax.set_yticks([])
    pc_ax.set_yticks([])
    pc_ax.set_ylabel('')


.. parsed-literal::

    <matplotlib.text.Text at 0x10f37d5d0>


.. image:: shalek2013_files/shalek2013_141_1.png


This looks pretty similar, maybe just rearranged cluster order. Let's
check what their data looks like when you plot this.

Their PC scores and clusters for the genes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    gene_pc_clusters = pd.read_excel('nature12172-s1/Supplementary_Table5.xls', index_col=0)
    gene_pc_clusters.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Annotation</th>
          <th>Cluster</th>
          <th>PC1 Score</th>
          <th>PC2 Score</th>
        </tr>
        <tr>
          <th>Gene</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>LNPEP</th>
          <td>    NaN</td>
          <td> 1</td>
          <td> 0.232368</td>
          <td> 0.677266</td>
        </tr>
        <tr>
          <th>TOR1AIP2</th>
          <td>  Antiv</td>
          <td> 1</td>
          <td>-0.075934</td>
          <td> 1.485877</td>
        </tr>
        <tr>
          <th>TNFSF4</th>
          <td>    NaN</td>
          <td> 1</td>
          <td> 0.497893</td>
          <td>-0.562412</td>
        </tr>
        <tr>
          <th>CFB</th>
          <td> Inflam</td>
          <td> 1</td>
          <td>-0.394318</td>
          <td> 1.277749</td>
        </tr>
        <tr>
          <th>H2-T10</th>
          <td>    NaN</td>
          <td> 1</td>
          <td> 0.514947</td>
          <td>-0.698538</td>
        </tr>
      </tbody>
    </table>
    </div>


.. code:: python

    data = lps_response_corr.ix[gene_pc_clusters.index, gene_pc_clusters.index].dropna(how='all', axis=0).dropna(how='all', axis=1)
    
    fig = plt.figure(figsize=(12, 10))
    gs = gridspec.GridSpec(2, 2, wspace=0.1, hspace=0.1, width_ratios=[1, .2], height_ratios=[1, .1])
    corr_ax = fig.add_subplot(gs[0, 0])
    corr_cbar_ax = fig.add_subplot(gs[1, 0])
    pc_ax = fig.add_subplot(gs[0, 1:])
    pc_cbar_ax = fig.add_subplot(gs[1:, 1:])
    
    sns.heatmap(data, linewidth=0, square=True, vmin=-1, vmax=1, ax=corr_ax, cbar_ax=corr_cbar_ax, cbar_kws=dict(orientation='horizontal'))
    sns.heatmap(gene_pc_clusters.ix[:, ['PC1 Score', 'PC2 Score']], linewidth=0, cmap=mpl.cm.PRGn,
                ax=pc_ax, cbar_ax=pc_cbar_ax, cbar_kws=dict(orientation='horizontal'), xticklabels=False, yticklabels=False)
    
    corr_ax.set_xlabel('')
    corr_ax.set_ylabel('')
    corr_ax.set_xticks([])
    corr_ax.set_yticks([])
    
    pc_ax.set_yticks([])
    pc_ax.set_ylabel('');


.. image:: shalek2013_files/shalek2013_145_0.png


Sure enough, if I use their annotations, I get exactly that. Though
there were two genes in their file that I didn't have in the
``lps_response_corr`` data:

.. code:: python

    gene_pc_clusters.index.difference(lps_response_corr.index)

::


    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)

    <ipython-input-53-f11acbabb21a> in <module>()
    ----> 1 gene_pc_clusters.index.difference(lps_response_corr.index)
    

    /usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in difference(self, other)
       1325             result_name = self.name if self.name == other.name else None
       1326 
    -> 1327         theDiff = sorted(set(self) - set(other))
       1328         return Index(theDiff, name=result_name)
       1329 


    TypeError: can't compare datetime.datetime to unicode


Oh joy, another ``datetime`` error, just like we had with
``expression2``... Looking back at the original Excel file, there is one
gene that Excel mangled to be a date:

.. figure:: https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_supplementary_table_5_datetime_error.png
   :align: center
   :alt: 

Please, can we start using just plain ole ``.csv``s for supplementary
data! Excel does NOT preserve strings if they start with numbers, and
instead thinks they are dates.

.. code:: python

    import collections
    collections.Counter(gene_pc_clusters.index.map(type))


.. parsed-literal::

    Counter({<type 'unicode'>: 631, <type 'datetime.datetime'>: 1})


Yep, it's just that one that got mangled.... oh well.

.. code:: python

    gene_pc_clusters_genes = set(filter(lambda x: isinstance(x, unicode), gene_pc_clusters.index))
    gene_pc_clusters_genes.difference(lps_response_corr.index)


.. parsed-literal::

    {u'RPS6KA2'}


So, "``RPS6KA2``" is the only gene that was in their list of genes and
not in mine.

Supplementary figures
---------------------

Now we get to have even more fun by plotting the Supplementary figures!
:D

Ironically, the supplementary figures are usually way easier to access
(like not behind a paywall), and yet they're usually the documents that
really have the crucial information about the experiments.

Supplementary Figure 1
~~~~~~~~~~~~~~~~~~~~~~

.. figure:: https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig1.png
   :align: center
   :alt: Supplementary figure 1, a correlation plot

   Supplementary figure 1, a correlation plot

.. code:: python

    singles_mean = study.expression.singles.mean()
    singles_mean.name = 'Single cell average'
    
    # Need to convert "average_singles" to a DataFrame instead of a single-row Series
    singles_mean = pd.DataFrame(singles_mean)
    singles_mean.head()
     

.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Single cell average</th>
        </tr>
        <tr>
          <th>GENE</th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>NPL</th>
          <td> 1.075740</td>
        </tr>
        <tr>
          <th>QK</th>
          <td> 2.019888</td>
        </tr>
        <tr>
          <th>AK163153</th>
          <td> 1.429369</td>
        </tr>
        <tr>
          <th>PARK2</th>
          <td> 0.596479</td>
        </tr>
        <tr>
          <th>AGPAT4</th>
          <td> 2.021294</td>
        </tr>
      </tbody>
    </table>
    </div>


.. code:: python

    data_for_correlations = pd.concat([study.expression.singles, singles_mean.T, study.expression.pooled])
    
    # Take the transpose of the data, because the plotting algorithm calculates correlations between columns,
    # And we want the correlations between samples, not features
    data_for_correlations = data_for_correlations.T
    data_for_correlations.head()
    
    # %time sns.corrplot(data_for_correlations)


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>S1</th>
          <th>S2</th>
          <th>S3</th>
          <th>S4</th>
          <th>S5</th>
          <th>S6</th>
          <th>S7</th>
          <th>S8</th>
          <th>S9</th>
          <th>S10</th>
          <th>...</th>
          <th>S13</th>
          <th>S14</th>
          <th>S15</th>
          <th>S16</th>
          <th>S17</th>
          <th>S18</th>
          <th>Single cell average</th>
          <th>P1</th>
          <th>P2</th>
          <th>P3</th>
        </tr>
        <tr>
          <th>GENE</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>NPL</th>
          <td> 4.290577</td>
          <td> 0.000000</td>
          <td> 4.860293</td>
          <td> 0.090829</td>
          <td> 0.000000</td>
          <td> 0.000000</td>
          <td> 4.730129</td>
          <td> 4.657090</td>
          <td> 0.112641</td>
          <td> 0.000000</td>
          <td>...</td>
          <td> 0.110470</td>
          <td> 0.099121</td>
          <td> 0.100920</td>
          <td> 0.206361</td>
          <td> 0.104884</td>
          <td> 0.000000</td>
          <td> 1.075740</td>
          <td> 2.093019</td>
          <td> 2.044724</td>
          <td> 2.742480</td>
        </tr>
        <tr>
          <th>QK</th>
          <td> 5.038477</td>
          <td> 4.183371</td>
          <td> 3.847854</td>
          <td> 0.066797</td>
          <td> 3.305915</td>
          <td> 0.114225</td>
          <td> 3.730270</td>
          <td> 2.750103</td>
          <td> 0.134389</td>
          <td> 0.760353</td>
          <td>...</td>
          <td> 3.395885</td>
          <td> 2.294456</td>
          <td> 0.301120</td>
          <td> 3.547688</td>
          <td> 2.185832</td>
          <td> 0.040923</td>
          <td> 2.019888</td>
          <td> 3.869102</td>
          <td> 3.690982</td>
          <td> 3.671838</td>
        </tr>
        <tr>
          <th>AK163153</th>
          <td> 1.249363</td>
          <td> 1.947622</td>
          <td> 1.082463</td>
          <td> 1.119633</td>
          <td> 1.267464</td>
          <td> 0.901824</td>
          <td> 1.033401</td>
          <td> 0.978591</td>
          <td> 1.220720</td>
          <td> 1.035237</td>
          <td>...</td>
          <td> 2.103135</td>
          <td> 1.110511</td>
          <td> 1.202271</td>
          <td> 4.446612</td>
          <td> 1.367261</td>
          <td> 0.428320</td>
          <td> 1.429369</td>
          <td> 0.605094</td>
          <td> 0.392494</td>
          <td> 0.284990</td>
        </tr>
        <tr>
          <th>PARK2</th>
          <td> 0.540694</td>
          <td> 0.500426</td>
          <td> 0.604097</td>
          <td> 0.418703</td>
          <td> 0.000000</td>
          <td> 0.601280</td>
          <td> 0.404931</td>
          <td> 0.552874</td>
          <td> 0.343271</td>
          <td> 0.844120</td>
          <td>...</td>
          <td> 0.755072</td>
          <td> 1.109400</td>
          <td> 0.807534</td>
          <td> 0.586962</td>
          <td> 0.485122</td>
          <td> 0.091469</td>
          <td> 0.596479</td>
          <td> 0.815242</td>
          <td> 0.267032</td>
          <td> 0.645365</td>
        </tr>
        <tr>
          <th>AGPAT4</th>
          <td> 0.095072</td>
          <td> 5.868557</td>
          <td> 4.137252</td>
          <td> 0.066015</td>
          <td> 0.000000</td>
          <td> 4.750107</td>
          <td> 0.069345</td>
          <td> 4.130618</td>
          <td> 3.328758</td>
          <td> 0.000000</td>
          <td>...</td>
          <td> 0.000000</td>
          <td> 4.430612</td>
          <td> 0.000000</td>
          <td> 0.000000</td>
          <td> 4.219120</td>
          <td> 0.171028</td>
          <td> 2.021294</td>
          <td> 2.854144</td>
          <td> 2.139655</td>
          <td> 2.806291</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 22 columns</p>
    </div>


.. code:: python

    fig, ax = plt.subplots(figsize=(10, 10))
    sns.corrplot(data_for_correlations, ax=ax)
    sns.despine()


.. image:: shalek2013_files/shalek2013_159_0.png


Notice that this is mostly red, while in the figure from the paper, it
was both blue and red. This is because the colormap started at 0.2 (not
negative), and was centered with white at about 0.6. I see that they're
trying to emphasize how much more correlated the pooled samples are to
each other, but I think a simple sequential map would have been more
effective.

Supplementary Figures 2 and 3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Supplementary `Figure
2 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig2.png>`_
and `Figure
3 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig3.png>`_
are from FISH and raw sequence data, and are out of the scope of this
computational reproduction.

Supplementary Figure 4
~~~~~~~~~~~~~~~~~~~~~~

`Supplementary Figure
4 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig4.png>`_
was from published data, however the citation in the Supplementary
Information (#23) was a `machine-learning
book <http://link.springer.com/book/10.1007%2F978-3-642-51175-2>`_, and
#23 in the main text citations was a `review of probabilistic graphical
models <http://www.sciencemag.org/content/303/5659/799.full>`_, neither
of which have the mouse embryonic stem cells or mouse embryonic
fibroblasts used in the figure.

Supplementary Figure 5
~~~~~~~~~~~~~~~~~~~~~~

For this figure, we can only plot 5d, since it's derived directly from a
table in their dataset.

Warning: these data are going to require some serious cleaning. Yay data
janitorial duties!

.. figure:: https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig5.png
   :align: center
   :alt: 

Supplementary Figure 5d
^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    barcoded = pd.read_excel('nature12172-s1/Supplementary_Table7.xlsx')
    barcoded.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>TPM</th>
          <th>Unnamed: 1</th>
          <th>Unnamed: 2</th>
          <th>Unnamed: 3</th>
          <th>Unique Barcodes</th>
          <th>Unnamed: 5</th>
          <th>Unnamed: 6</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>GENE</th>
          <td>    MB_S1</td>
          <td>    MB_S2</td>
          <td>     MB_S3</td>
          <td>NaN</td>
          <td> MB_S1</td>
          <td> MB_S2</td>
          <td> MB_S3</td>
        </tr>
        <tr>
          <th>0610007L01RIK</th>
          <td>        0</td>
          <td>        0</td>
          <td>  5.595054</td>
          <td>NaN</td>
          <td>     0</td>
          <td>     0</td>
          <td>     0</td>
        </tr>
        <tr>
          <th>0610007P14RIK</th>
          <td> 76.25091</td>
          <td> 38.77614</td>
          <td> 0.1823286</td>
          <td>NaN</td>
          <td>    23</td>
          <td>     8</td>
          <td>     0</td>
        </tr>
        <tr>
          <th>0610007P22RIK</th>
          <td> 24.26729</td>
          <td> 50.24694</td>
          <td>  17.74422</td>
          <td>NaN</td>
          <td>    14</td>
          <td>     5</td>
          <td>     6</td>
        </tr>
        <tr>
          <th>0610008F07RIK</th>
          <td>        0</td>
          <td>        0</td>
          <td>         0</td>
          <td>NaN</td>
          <td>     0</td>
          <td>     0</td>
          <td>     0</td>
        </tr>
      </tbody>
    </table>
    </div>


The first three columns are TPM calculated from the three samples that
have molecular barcodes, and the last three columns are the integer
counts of molecular barcodes from the three molecular barcode samples.

Let's remove the "Unnamed: 3" column which is all NaNs. We'll do that
with the ``.dropna`` method, specifying ``axis=1`` for columns and
``how="all"`` to make sure only columns that have ALL NaNs are removed.

.. code:: python

    barcoded = barcoded.dropna(how='all', axis=1)
    barcoded.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>TPM</th>
          <th>Unnamed: 1</th>
          <th>Unnamed: 2</th>
          <th>Unique Barcodes</th>
          <th>Unnamed: 5</th>
          <th>Unnamed: 6</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>GENE</th>
          <td>    MB_S1</td>
          <td>    MB_S2</td>
          <td>     MB_S3</td>
          <td> MB_S1</td>
          <td> MB_S2</td>
          <td> MB_S3</td>
        </tr>
        <tr>
          <th>0610007L01RIK</th>
          <td>        0</td>
          <td>        0</td>
          <td>  5.595054</td>
          <td>     0</td>
          <td>     0</td>
          <td>     0</td>
        </tr>
        <tr>
          <th>0610007P14RIK</th>
          <td> 76.25091</td>
          <td> 38.77614</td>
          <td> 0.1823286</td>
          <td>    23</td>
          <td>     8</td>
          <td>     0</td>
        </tr>
        <tr>
          <th>0610007P22RIK</th>
          <td> 24.26729</td>
          <td> 50.24694</td>
          <td>  17.74422</td>
          <td>    14</td>
          <td>     5</td>
          <td>     6</td>
        </tr>
        <tr>
          <th>0610008F07RIK</th>
          <td>        0</td>
          <td>        0</td>
          <td>         0</td>
          <td>     0</td>
          <td>     0</td>
          <td>     0</td>
        </tr>
      </tbody>
    </table>
    </div>


Next, let's drop that pesky "GENE" row. Don't worry, we'll get the
sample ID names back next.

.. code:: python

    barcoded = barcoded.drop('GENE', axis=0)
    barcoded.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>TPM</th>
          <th>Unnamed: 1</th>
          <th>Unnamed: 2</th>
          <th>Unique Barcodes</th>
          <th>Unnamed: 5</th>
          <th>Unnamed: 6</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0610007L01RIK</th>
          <td>        0</td>
          <td>        0</td>
          <td>  5.595054</td>
          <td>  0</td>
          <td>  0</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>0610007P14RIK</th>
          <td> 76.25091</td>
          <td> 38.77614</td>
          <td> 0.1823286</td>
          <td> 23</td>
          <td>  8</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>0610007P22RIK</th>
          <td> 24.26729</td>
          <td> 50.24694</td>
          <td>  17.74422</td>
          <td> 14</td>
          <td>  5</td>
          <td> 6</td>
        </tr>
        <tr>
          <th>0610008F07RIK</th>
          <td>        0</td>
          <td>        0</td>
          <td>         0</td>
          <td>  0</td>
          <td>  0</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>0610009B22RIK</th>
          <td> 67.12981</td>
          <td> 115.1393</td>
          <td>  55.98812</td>
          <td> 11</td>
          <td> 18</td>
          <td> 8</td>
        </tr>
      </tbody>
    </table>
    </div>


We'll create a ``pandas.MultiIndex`` from the tuples of
``(sample_id, measurement_type)`` pair.

.. code:: python

    columns = pd.MultiIndex.from_tuples([('MB_S1', 'TPM'),
               ('MB_S2', 'TPM'),
               ('MB_S3', 'TPM'),
               ('MB_S1', 'Unique Barcodes'),
               ('MB_S2', 'Unique Barcodes'),
               ('MB_S3', 'Unique Barcodes')])
    barcoded.columns = columns
    barcoded = barcoded.sort_index(axis=1)
    barcoded.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th></th>
          <th colspan="2" halign="left">MB_S1</th>
          <th colspan="2" halign="left">MB_S2</th>
          <th colspan="2" halign="left">MB_S3</th>
        </tr>
        <tr>
          <th></th>
          <th>TPM</th>
          <th>Unique Barcodes</th>
          <th>TPM</th>
          <th>Unique Barcodes</th>
          <th>TPM</th>
          <th>Unique Barcodes</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0610007L01RIK</th>
          <td>        0</td>
          <td>  0</td>
          <td>        0</td>
          <td>  0</td>
          <td>  5.595054</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>0610007P14RIK</th>
          <td> 76.25091</td>
          <td> 23</td>
          <td> 38.77614</td>
          <td>  8</td>
          <td> 0.1823286</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>0610007P22RIK</th>
          <td> 24.26729</td>
          <td> 14</td>
          <td> 50.24694</td>
          <td>  5</td>
          <td>  17.74422</td>
          <td> 6</td>
        </tr>
        <tr>
          <th>0610008F07RIK</th>
          <td>        0</td>
          <td>  0</td>
          <td>        0</td>
          <td>  0</td>
          <td>         0</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>0610009B22RIK</th>
          <td> 67.12981</td>
          <td> 11</td>
          <td> 115.1393</td>
          <td> 18</td>
          <td>  55.98812</td>
          <td> 8</td>
        </tr>
      </tbody>
    </table>
    </div>


For the next move, we're going to do some crazy ``pandas``-fu. First
we're going to transpose, then ``reset_index`` of the transpose. Just so
you know what this looks like, it's this.

.. code:: python

    barcoded.T.reset_index().head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>level_0</th>
          <th>level_1</th>
          <th>0610007L01RIK</th>
          <th>0610007P14RIK</th>
          <th>0610007P22RIK</th>
          <th>0610008F07RIK</th>
          <th>0610009B22RIK</th>
          <th>0610009D07RIK</th>
          <th>0610009O20RIK</th>
          <th>0610010B08RIK</th>
          <th>...</th>
          <th>ZWILCH</th>
          <th>ZWINT</th>
          <th>ZXDA</th>
          <th>ZXDB</th>
          <th>ZXDC</th>
          <th>ZYG11A</th>
          <th>ZYG11B</th>
          <th>ZYX</th>
          <th>ZZEF1</th>
          <th>ZZZ3</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td> MB_S1</td>
          <td>             TPM</td>
          <td>        0</td>
          <td>  76.25091</td>
          <td> 24.26729</td>
          <td> 0</td>
          <td> 67.12981</td>
          <td> 132.2392</td>
          <td> 17.03907</td>
          <td> 0.01375923</td>
          <td>...</td>
          <td> 0</td>
          <td> 206.8494</td>
          <td>        0</td>
          <td> 0</td>
          <td> 0</td>
          <td> 0</td>
          <td> 0.01985733</td>
          <td> 55.28996</td>
          <td> 0.09482778</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>1</th>
          <td> MB_S1</td>
          <td> Unique Barcodes</td>
          <td>        0</td>
          <td>        23</td>
          <td>       14</td>
          <td> 0</td>
          <td>       11</td>
          <td>       29</td>
          <td>        3</td>
          <td>          1</td>
          <td>...</td>
          <td> 0</td>
          <td>       33</td>
          <td>        0</td>
          <td> 0</td>
          <td> 0</td>
          <td> 0</td>
          <td>          0</td>
          <td>        6</td>
          <td>          0</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>2</th>
          <td> MB_S2</td>
          <td>             TPM</td>
          <td>        0</td>
          <td>  38.77614</td>
          <td> 50.24694</td>
          <td> 0</td>
          <td> 115.1393</td>
          <td> 49.16287</td>
          <td>        0</td>
          <td>          0</td>
          <td>...</td>
          <td> 0</td>
          <td>  48.7729</td>
          <td>        0</td>
          <td> 0</td>
          <td> 0</td>
          <td> 0</td>
          <td>   7.894789</td>
          <td> 135.1977</td>
          <td>          0</td>
          <td> 4.272594</td>
        </tr>
        <tr>
          <th>3</th>
          <td> MB_S2</td>
          <td> Unique Barcodes</td>
          <td>        0</td>
          <td>         8</td>
          <td>        5</td>
          <td> 0</td>
          <td>       18</td>
          <td>       11</td>
          <td>        0</td>
          <td>          0</td>
          <td>...</td>
          <td> 0</td>
          <td>       10</td>
          <td>        0</td>
          <td> 0</td>
          <td> 0</td>
          <td> 0</td>
          <td>          0</td>
          <td>        7</td>
          <td>          0</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>4</th>
          <td> MB_S3</td>
          <td>             TPM</td>
          <td> 5.595054</td>
          <td> 0.1823286</td>
          <td> 17.74422</td>
          <td> 0</td>
          <td> 55.98812</td>
          <td> 203.6302</td>
          <td>        0</td>
          <td>  0.4914763</td>
          <td>...</td>
          <td> 0</td>
          <td> 54.51386</td>
          <td> 1.120081</td>
          <td> 0</td>
          <td> 0</td>
          <td> 0</td>
          <td>  0.1238624</td>
          <td> 340.7358</td>
          <td>  0.6677646</td>
          <td>        0</td>
        </tr>
      </tbody>
    </table>
    <p>5 rows × 27725 columns</p>
    </div>


Next, we're going to transform the data into a
`tidy <http://vita.had.co.nz/papers/tidy-data.pdf>`_ format, with
separate columns for sample ids, measurement types, the gene that was
measured, and its measurement value.

.. code:: python

    barcoded_tidy = pd.melt(barcoded.T.reset_index(), id_vars=['level_0', 'level_1'])
    barcoded_tidy.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>level_0</th>
          <th>level_1</th>
          <th>variable</th>
          <th>value</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td> MB_S1</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>1</th>
          <td> MB_S1</td>
          <td> Unique Barcodes</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>2</th>
          <td> MB_S2</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>3</th>
          <td> MB_S2</td>
          <td> Unique Barcodes</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>4</th>
          <td> MB_S3</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td> 5.595054</td>
        </tr>
      </tbody>
    </table>
    </div>


Now let's rename these columns into something more useful, instead of
"level\_0"

.. code:: python

    barcoded_tidy = barcoded_tidy.rename(columns={'level_0': 'sample_id', 'level_1': 'measurement', 'variable': 'gene_name'})
    barcoded_tidy.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>sample_id</th>
          <th>measurement</th>
          <th>gene_name</th>
          <th>value</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td> MB_S1</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>1</th>
          <td> MB_S1</td>
          <td> Unique Barcodes</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>2</th>
          <td> MB_S2</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>3</th>
          <td> MB_S2</td>
          <td> Unique Barcodes</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
        </tr>
        <tr>
          <th>4</th>
          <td> MB_S3</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td> 5.595054</td>
        </tr>
      </tbody>
    </table>
    </div>


Next, we're going to take some seemingly-duplicating steps, but trust
me, it'll make the data easier.

.. code:: python

    barcoded_tidy['TPM'] = barcoded_tidy.value[barcoded_tidy.measurement == 'TPM']
    barcoded_tidy['Unique Barcodes'] = barcoded_tidy.value[barcoded_tidy.measurement == 'Unique Barcodes']
Fill the values of the "**TPM**"'s forwards, since they appear first,
and fill the values of the "**Unique Barcodes**" backwards, since
they're second

.. code:: python

    barcoded_tidy.TPM = barcoded_tidy.TPM.ffill()
    barcoded_tidy['Unique Barcodes'] = barcoded_tidy['Unique Barcodes'].bfill()
    barcoded_tidy.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>sample_id</th>
          <th>measurement</th>
          <th>gene_name</th>
          <th>value</th>
          <th>TPM</th>
          <th>Unique Barcodes</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td> MB_S1</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
          <td> 0.000000</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>1</th>
          <td> MB_S1</td>
          <td> Unique Barcodes</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
          <td> 0.000000</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>2</th>
          <td> MB_S2</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
          <td> 0.000000</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>3</th>
          <td> MB_S2</td>
          <td> Unique Barcodes</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
          <td> 0.000000</td>
          <td> 0</td>
        </tr>
        <tr>
          <th>4</th>
          <td> MB_S3</td>
          <td>             TPM</td>
          <td> 0610007L01RIK</td>
          <td> 5.595054</td>
          <td> 5.595054</td>
          <td> 0</td>
        </tr>
      </tbody>
    </table>
    </div>


Drop the "**measurement**" column and drop duplicate rows.

.. code:: python

    barcoded_tidy = barcoded_tidy.drop('measurement', axis=1)
    barcoded_tidy = barcoded_tidy.drop_duplicates()
    barcoded_tidy.head()


.. raw:: html

    <div style="max-height:1000px;max-width:1500px;overflow:auto;">
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>sample_id</th>
          <th>gene_name</th>
          <th>value</th>
          <th>TPM</th>
          <th>Unique Barcodes</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td> MB_S1</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
          <td>  0.000000</td>
          <td>  0</td>
        </tr>
        <tr>
          <th>2</th>
          <td> MB_S2</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
          <td>  0.000000</td>
          <td>  0</td>
        </tr>
        <tr>
          <th>4</th>
          <td> MB_S3</td>
          <td> 0610007L01RIK</td>
          <td> 5.595054</td>
          <td>  5.595054</td>
          <td>  0</td>
        </tr>
        <tr>
          <th>5</th>
          <td> MB_S3</td>
          <td> 0610007L01RIK</td>
          <td>        0</td>
          <td>  5.595054</td>
          <td>  0</td>
        </tr>
        <tr>
          <th>6</th>
          <td> MB_S1</td>
          <td> 0610007P14RIK</td>
          <td> 76.25091</td>
          <td> 76.250913</td>
          <td> 23</td>
        </tr>
      </tbody>
    </table>
    </div>


.. code:: python

    barcoded_tidy['log TPM'] = np.log(barcoded_tidy.TPM)
    barcoded_tidy['log Unique Barcodes'] = np.log(barcoded_tidy['Unique Barcodes'])
Now we can use the convenient linear model plot (``lmplot``) in
``seaborn`` to plot these three samples together!

.. code:: python

    sns.lmplot('log TPM', 'log Unique Barcodes', barcoded_tidy, col='sample_id')


.. parsed-literal::

    <seaborn.axisgrid.FacetGrid at 0x123293390>


.. image:: shalek2013_files/shalek2013_189_1.png


Supplementary Figures 6-20
~~~~~~~~~~~~~~~~~~~~~~~~~~

Supplementary Figures
`6 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig6.png>`_,
`7 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig7.png>`_,
`8 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig8.png>`_,
`9 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig9.png>`_,
`10 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig10.png>`_,
`11 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig11.png>`_,
`12 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig12.png>`_,
`13 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig13.png>`_,
`14 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig14.png>`_,
`15 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig15.png>`_,
`16 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig16.png>`_,
`17 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig17.png>`_,
`18 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig18.png>`_,
`19 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig19.png>`_,
and
`20 <https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig20.png>`_,
deal with splicing data from the molecular barcodes, RNA-FISH,
flow-sorted cells, and single-cell RT-PCR and are out of the scope of
this reproduction.

Conclusions
-----------

While there may be minor, undocumented, differences between the methods
presented in the manuscript and the figures, the application of
```flotilla`` <https://github.com/YeoLab/flotilla>`_ presents an
opportunity to avoid these types of inconsistencies by strictly
documenting every change to code and every transformation of the data.
The biology the authors found is clearly real, as they did the knockout
experiment of *Ifnr-/-* and saw that indeed the maturation process was
affected, and *Stat2* and *Irf7* had much lower expression, as with the
"maturing" cells in the data.