In the interest of reproducibility, and to showcase our new package ```flotilla`` `_, I've reproduced many figures from the landmark single-cell paper, `Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells `_ by Shalek and Sujita, *et al*. *Nature* (2013). Before we begin, let's import everything we need. .. code:: python # Import the flotilla package for biological data analysis import flotilla # Import "numerical python" library for number crunching import numpy as np # Import "panel data analysis" library for tabular data import pandas as pd # Import statistical data visualization package # Note: As of November 6th, 2014, you will need the "master" version of # seaborn on github (v0.5.dev), installed via # "pip install git+ssh://git@github.com/mwaskom/seaborn.git import seaborn as sns # Turn on inline plots with IPython %matplotlib inline .. parsed-literal:: Couldn't import dot_parser, loading of dot files will not be possible. Shalek and Sujita, *et al* (2013) ================================= In the 2013 paper, `Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells `_ (Shalek and Sujita, *et al*. *Nature* (2013)), Regev and colleagues performed single-cell sequencing 18 bone marrow-derived dendritic cells (BMDCs), in addition to 3 pooled samples. Expression data --------------- First, we will read in the expression data. These data were obtained using, .. code:: python ! wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz .. parsed-literal:: --2014-11-10 12:35:20-- ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz => 'GSE41265_allGenesTPM.txt.gz.1' Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.7, 2607:f220:41e:250::11 Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.7|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /geo/series/GSE41nnn/GSE41265/suppl ... done. ==> SIZE GSE41265_allGenesTPM.txt.gz ... 1099290 ==> PASV ... done. ==> RETR GSE41265_allGenesTPM.txt.gz ... done. Length: 1099290 (1.0M) (unauthoritative) 100%[======================================>] 1,099,290 6.28MB/s in 0.2s 2014-11-10 12:35:21 (6.28 MB/s) - 'GSE41265_allGenesTPM.txt.gz.1' saved [1099290] We will also compare to the supplementary table 2 data, obtained using .. code:: python ! wget http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip ! unzip nature12172-s1.zip .. parsed-literal:: --2014-11-10 12:35:22-- http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip Resolving www.nature.com... 23.62.97.233, 23.62.97.227 Connecting to www.nature.com|23.62.97.233|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4634226 (4.4M) [application/zip] Saving to: 'nature12172-s1.zip' 100%[======================================>] 4,634,226 1.17MB/s in 3.8s 2014-11-10 12:35:26 (1.16 MB/s) - 'nature12172-s1.zip' saved [4634226/4634226] Archive: nature12172-s1.zip creating: nature12172-s1/ inflating: nature12172-s1/Supplementary_Table1.xls inflating: nature12172-s1/Supplementary_Table2.xlsx inflating: nature12172-s1/Supplementary_Table3.xls inflating: nature12172-s1/Supplementary_Table4.xls inflating: nature12172-s1/Supplementary_Table5.xls inflating: nature12172-s1/Supplementary_Table6.xls inflating: nature12172-s1/Supplementary_Table7.xlsx .. code:: python expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0) expression.head() .. raw:: html
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 ... S12 S13 S14 S15 S16 S17 S18 P1 P2 P3
GENE
XKR4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.019906 0.000000
AB338584 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
B3GAT2 0.000000 0.000000 0.023441 0.000000 0.000000 0.029378 0.000000 0.055452 0.000000 0.029448 ... 0.000000 0.000000 0.031654 0.000000 0.000000 0.000000 42.150208 0.680327 0.022996 0.110236
NPL 72.008590 0.000000 128.062012 0.095082 0.000000 0.000000 112.310234 104.329122 0.119230 0.000000 ... 0.000000 0.116802 0.104200 0.106188 0.229197 0.110582 0.000000 7.109356 6.727028 14.525447
T2 0.109249 0.172009 0.000000 0.000000 0.182703 0.076012 0.078698 0.000000 0.093698 0.076583 ... 0.693459 0.010137 0.081936 0.000000 0.000000 0.086879 0.068174 0.062063 0.000000 0.050605

5 rows × 21 columns

These data are in the "transcripts per million," aka TPM unit. See `this `_ blog post if that sounds weird to you. These data are formatted with samples on the columns, and genes on the rows. But we want the opposite, with samples on the rows and genes on the columns. This follows ```scikit-learn`` `_'s standard of data matrices with size (``n_samples``, ``n_features``) as each gene is a feature. So we will simply transpose this. .. code:: python expression = expression.T expression.head() .. raw:: html
GENE XKR4 AB338584 B3GAT2 NPL T2 T PDE10A 1700010I14RIK 6530411M01RIK PABPC6 ... AK085062 DHX9 RNASET2B FGFR1OP CCR6 BRP44L AK014435 AK015714 SFT2D1 PRR18
S1 0 0 0.000000 72.008590 0.109249 0 0 0 0 0 ... 0 0.774638 23.520936 0.000000 0 460.316773 0 0.000000 39.442566 0
S2 0 0 0.000000 0.000000 0.172009 0 0 0 0 0 ... 0 0.367391 1.887873 0.000000 0 823.890290 0 0.000000 4.967412 0
S3 0 0 0.023441 128.062012 0.000000 0 0 0 0 0 ... 0 0.249858 0.313510 0.166772 0 1002.354241 0 0.000000 0.000000 0
S4 0 0 0.000000 0.095082 0.000000 0 0 0 0 0 ... 0 0.354157 0.000000 0.887003 0 1230.766795 0 0.000000 0.131215 0
S5 0 0 0.000000 0.000000 0.182703 0 0 0 0 0 ... 0 0.039263 0.000000 131.077131 0 1614.749122 0 0.242179 95.485743 0

5 rows × 27723 columns

The authors filtered the expression data based on having at least 3 single cells express genes with at TPM (transcripts per million, ) > 1. We can express this in using the ```pandas`` `_ DataFrames easily. First, from reading the paper and looking at the data, I know there are 18 single cells, and there are 18 samples that start with the letter "S." So I will extract the single samples from the ``index`` (row names) using a ``lambda``, a tiny function which in this case, tells me whether or not that sample id begins with the letter "S". .. code:: python singles_ids = expression.index[expression.index.map(lambda x: x.startswith('S'))] print('number of single cells:', len(singles_ids)) singles = expression.ix[singles_ids] expression_filtered = expression.ix[:, singles[singles > 1].count() >= 3] expression_filtered = np.log(expression_filtered + 1) expression_filtered.shape .. parsed-literal:: ('number of single cells:', 18) .. parsed-literal:: (21, 6312) Hmm, that's strange. The paper states that they had 6313 genes after filtering, but I get 6312. Even using "``singles >= 1``" doesn't help. (I also tried this with the expression table provided in the supplementary data as "``SupplementaryTable2.xlsx``," and got the same results.) Now that we've taken care of importing and filtering the expression data, let's do the feature data of the expression data. Expression feature data ----------------------- This is similar to the ``fData`` from ``BioconductoR``, where there's some additional data on your features that you want to look at. They uploaded information about the features in their OTHER expression matrix, uploaded as a supplementary file, ``Supplementary_Table2.xlsx``. Notice that this is a ``csv`` and not an ``xlsx``. This is because Excel mangled the gene IDS that started with ``201*`` and assumed they were dates :( The workaround I did was to add another column to the sheet with the formula ``="'" & A1``, press ``Command``-``Shift``-``End`` to select the end of the rows, and then do ``Ctrl``-``D`` to "fill down" to the bottom (thanks to `this `_ stackoverflow post for teaching me how to Excel). Then, I saved the file as a ``csv`` for maximum portability and compatibility. .. code:: python expression2 = pd.read_csv('nature12172-s1/Supplementary_Table2.csv', # Need to specify the index column as both the first and the last columns, # Because the last column is the "Gene Category" index_col=[0, -1], parse_dates=False, infer_datetime_format=False) # This was also in features x samples format, so we need to transpose expression2 = expression2.T expression2.head() .. raw:: html
'GENE '0610007L01RIK '0610007P14RIK '0610007P22RIK '0610008F07RIK '0610009B22RIK '0610009D07RIK '0610009O20RIK '0610010B08RIK '0610010F05RIK '0610010K06RIK ... 'ZWILCH 'ZWINT 'ZXDA 'ZXDB 'ZXDC 'ZYG11A 'ZYG11B 'ZYX 'ZZEF1 'ZZZ3
Gene Category NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S1 27.181570 0.166794 0 0 0.000000 178.852732 0 0.962417 0.000000 143.359550 ... 0.000000 302.361227 0.000000 0 0 0 0.027717 297.918756 37.685501 0.000000
S2 37.682691 0.263962 0 0 0.207921 0.141099 0 0.000000 0.000000 0.255617 ... 0.000000 96.033724 0.020459 0 0 0 0.042430 0.242888 0.000000 0.000000
S3 0.056916 78.622459 0 0 0.145680 0.396363 0 0.000000 0.024692 72.775846 ... 0.000000 427.915555 0.000000 0 0 0 0.040407 6.753530 0.132011 0.017615
S4 55.649250 0.228866 0 0 0.000000 88.798158 0 0.000000 0.000000 93.825442 ... 0.000000 9.788557 0.017787 0 0 0 0.013452 0.274689 9.724890 0.000000
S5 0.000000 0.093117 0 0 131.326008 155.936361 0 0.000000 0.000000 0.031029 ... 0.204522 26.575760 0.000000 0 0 0 1.101589 59.256094 44.430726 0.000000

5 rows × 27723 columns

Now we need to strip the single-quote I added to all the gene names: .. code:: python new_index, indexer = expression2.columns.reindex(map(lambda x: (x[0].lstrip("'"), x[1]), expression2.columns.values)) expression2.columns = new_index expression2.head() .. raw:: html
'GENE 0610007L01RIK 0610007P14RIK 0610007P22RIK 0610008F07RIK 0610009B22RIK 0610009D07RIK 0610009O20RIK 0610010B08RIK 0610010F05RIK 0610010K06RIK ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
Gene Category NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S1 27.181570 0.166794 0 0 0.000000 178.852732 0 0.962417 0.000000 143.359550 ... 0.000000 302.361227 0.000000 0 0 0 0.027717 297.918756 37.685501 0.000000
S2 37.682691 0.263962 0 0 0.207921 0.141099 0 0.000000 0.000000 0.255617 ... 0.000000 96.033724 0.020459 0 0 0 0.042430 0.242888 0.000000 0.000000
S3 0.056916 78.622459 0 0 0.145680 0.396363 0 0.000000 0.024692 72.775846 ... 0.000000 427.915555 0.000000 0 0 0 0.040407 6.753530 0.132011 0.017615
S4 55.649250 0.228866 0 0 0.000000 88.798158 0 0.000000 0.000000 93.825442 ... 0.000000 9.788557 0.017787 0 0 0 0.013452 0.274689 9.724890 0.000000
S5 0.000000 0.093117 0 0 131.326008 155.936361 0 0.000000 0.000000 0.031029 ... 0.204522 26.575760 0.000000 0 0 0 1.101589 59.256094 44.430726 0.000000

5 rows × 27723 columns

We want to create a ``pandas.DataFrame`` from the "Gene Category" row for our ``expression_feature_data``, which we will do via: .. code:: python gene_ids, gene_category = zip(*expression2.columns.values) gene_categories = pd.Series(gene_category, index=gene_ids, name='gene_category') gene_categories .. parsed-literal:: 0610007L01RIK NaN 0610007P14RIK NaN 0610007P22RIK NaN 0610008F07RIK NaN 0610009B22RIK NaN 0610009D07RIK NaN 0610009O20RIK NaN 0610010B08RIK NaN 0610010F05RIK NaN 0610010K06RIK NaN 0610010K14RIK NaN 0610010O12RIK NaN 0610011F06RIK NaN 0610011L14RIK NaN 0610012G03RIK NaN ... ZSWIM5 NaN ZSWIM6 NaN ZSWIM7 NaN ZUFSP LPS Response ZW10 NaN ZWILCH NaN ZWINT NaN ZXDA NaN ZXDB NaN ZXDC NaN ZYG11A NaN ZYG11B NaN ZYX NaN ZZEF1 NaN ZZZ3 NaN Name: gene_category, Length: 27723, dtype: object .. code:: python expression_feature_data = pd.DataFrame(gene_categories) expression_feature_data.head() .. raw:: html
gene_category
0610007L01RIK NaN
0610007P14RIK NaN
0610007P22RIK NaN
0610008F07RIK NaN
0610009B22RIK NaN
Splicing Data ------------- We obtain the splicing data from this study from the supplementary information, specifically the ``Supplementary_Table4.xls`` .. code:: python splicing = pd.read_excel('nature12172-s1/Supplementary_Table4.xls', 'splicingTable.txt', index_col=(0,1)) splicing.head() .. raw:: html
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S13 S14 S15 S16 S17 S18 10,000 cell Rep1 (P1) 10,000 cell Rep2 (P2) 10,000 cell Rep3 (P3)
Event name gene
chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- Os9 0.84 NaN NaN NaN NaN 0.01 NaN NaN NaN NaN 0.03 NaN NaN 0.02 NaN 0.01 NaN 0.27 0.37 0.31
chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- Vta1 0.95 NaN NaN 0.84 0.95 0.91 0.87 0.86 NaN NaN 0.93 NaN NaN 0.96 NaN NaN NaN 0.83 0.85 0.64
chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ Bclaf1 NaN 0.04 0.02 NaN NaN 0.14 NaN 0.02 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN 0.40 0.49 0.59
chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ Bclaf1 0.02 0.98 0.55 NaN NaN NaN NaN 0.98 NaN NaN NaN NaN NaN 0.06 NaN NaN NaN 0.62 0.63 0.70
chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ P4ha1 0.42 NaN NaN NaN 0.94 NaN NaN 0.03 0.97 NaN NaN NaN NaN NaN NaN NaN NaN 0.43 0.36 0.52
.. code:: python splicing = splicing.T splicing .. raw:: html
Event name chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ ... chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene Os9 Vta1 Bclaf1 Bclaf1 P4ha1 Bsg Ptbp1 Igf1 Elk3 Nbr1 ... Afg3l1 Tep1 Fgd2 Ttc17 Tmed1 Tmed1 Sbno2 Synj2 Tbc1d13 Usp47
S1 0.84 0.95 NaN 0.02 0.42 NaN 0.57 0.31 0.93 0.57 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S2 NaN NaN 0.04 0.98 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S3 NaN NaN 0.02 0.55 NaN NaN NaN 0.20 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S4 NaN 0.84 NaN NaN NaN NaN NaN 0.95 NaN 0.04 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S5 NaN 0.95 NaN NaN 0.94 NaN NaN 0.73 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S6 0.01 0.91 0.14 NaN NaN NaN NaN 0.61 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S7 NaN 0.87 NaN NaN NaN 0.62 NaN 0.85 0.73 0.55 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S8 NaN 0.86 0.02 0.98 0.03 NaN NaN 0.89 0.82 0.83 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S9 NaN NaN NaN NaN 0.97 NaN 0.97 NaN 0.90 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S10 NaN NaN NaN NaN NaN NaN 0.06 0.98 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S11 0.03 0.93 NaN NaN NaN NaN NaN NaN 0.97 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S14 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.88 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S15 0.02 0.96 0.01 0.06 NaN NaN NaN 0.44 NaN NaN ... 0.91 NaN NaN NaN NaN NaN NaN NaN NaN NaN
S16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.27 0.99 0.99 0.98 0.98 NaN NaN NaN NaN
S17 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 0.96 NaN NaN NaN 0.99 0.98 0.67 0.07
S18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10,000 cell Rep1 (P1) 0.27 0.83 0.40 0.62 0.43 0.78 NaN 0.60 0.76 0.52 ... 0.92 NaN 0.81 0.77 NaN NaN 0.84 0.50 0.56 NaN
10,000 cell Rep2 (P2) 0.37 0.85 0.49 0.63 0.36 0.72 0.47 0.60 0.73 0.68 ... 0.67 0.15 0.52 0.67 0.63 0.73 0.82 0.90 0.71 0.55
10,000 cell Rep3 (P3) 0.31 0.64 0.59 0.70 0.52 0.79 NaN 0.65 0.42 0.64 ... 0.58 0.79 0.74 0.85 0.73 0.39 0.56 NaN 0.64 NaN

20 rows × 352 columns

The three pooled samples aren't named consistently with the expression data, so we have to fix that. .. code:: python splicing.index[splicing.index.map(lambda x: 'P' in x)] .. parsed-literal:: Index([u'10,000 cell Rep1 (P1)', u'10,000 cell Rep2 (P2)', u'10,000 cell Rep3 (P3)'], dtype='object') Since the pooled sample IDs are inconsistent with the ``expression`` data, we have to change them. We can get the "P" and the number after that using regular expressions, called ``re`` in the Python standard library, e.g.: .. code:: python import re re.search(r'P\d', '10,000 cell Rep1 (P1)').group() .. parsed-literal:: 'P1' .. code:: python def long_pooled_name_to_short(x): if 'P' not in x: return x else: return re.search(r'P\d', x).group() splicing.index.map(long_pooled_name_to_short) .. parsed-literal:: array([u'S1', u'S2', u'S3', u'S4', u'S5', u'S6', u'S7', u'S8', u'S9', u'S10', u'S11', u'S13', u'S14', u'S15', u'S16', u'S17', u'S18', u'P1', u'P2', u'P3'], dtype=object) And now we assign this new index as our index to the ``splicing`` dataframe .. code:: python splicing.index = splicing.index.map(long_pooled_name_to_short) splicing.head() .. raw:: html
Event name chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ ... chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene Os9 Vta1 Bclaf1 Bclaf1 P4ha1 Bsg Ptbp1 Igf1 Elk3 Nbr1 ... Afg3l1 Tep1 Fgd2 Ttc17 Tmed1 Tmed1 Sbno2 Synj2 Tbc1d13 Usp47
S1 0.84 0.95 NaN 0.02 0.42 NaN 0.57 0.31 0.93 0.57 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S2 NaN NaN 0.04 0.98 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S3 NaN NaN 0.02 0.55 NaN NaN NaN 0.20 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S4 NaN 0.84 NaN NaN NaN NaN NaN 0.95 NaN 0.04 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S5 NaN 0.95 NaN NaN 0.94 NaN NaN 0.73 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 352 columns

Metadata -------- Now let's get into creating a metadata dataframe. We'll use the index from the ``expression_filtered`` data to create the minimum required column, ``'phenotype'``, which has the name of the phenotype of that cell. And we'll also add the column ``'pooled'`` to indicate whether this sample is pooled or not. .. code:: python metadata = pd.DataFrame(index=expression_filtered.index) metadata['phenotype'] = 'BDMC' metadata['pooled'] = metadata.index.map(lambda x: x.startswith('P')) metadata .. raw:: html
phenotype pooled
S1 BDMC False
S2 BDMC False
S3 BDMC False
S4 BDMC False
S5 BDMC False
S6 BDMC False
S7 BDMC False
S8 BDMC False
S9 BDMC False
S10 BDMC False
S11 BDMC False
S12 BDMC False
S13 BDMC False
S14 BDMC False
S15 BDMC False
S16 BDMC False
S17 BDMC False
S18 BDMC False
P1 BDMC True
P2 BDMC True
P3 BDMC True
Mapping stats data ------------------ .. code:: python mapping_stats = pd.read_excel('nature12172-s1/Supplementary_Table1.xls', sheetname='SuppTable1 2.txt') mapping_stats .. raw:: html
Sample PF_READS PCT_MAPPED_GENOME PCT_RIBOSOMAL_BASES MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS MEDIAN_5PRIME_TO_3PRIME_BIAS
0 S1 21326048 0.706590 0.006820 0.509939 0.092679 0.477321 0.247741
1 S2 27434011 0.745385 0.004111 0.565732 0.056583 0.321053 0.244062
2 S3 31142391 0.722087 0.006428 0.540341 0.079551 0.382286 0.267367
3 S4 26231852 0.737854 0.004959 0.530978 0.067041 0.351670 0.279782
4 S5 29977214 0.746466 0.006121 0.525598 0.066543 0.353995 0.274252
5 S6 24148387 0.730079 0.008794 0.529650 0.072095 0.413696 0.225929
6 S7 24078116 0.730638 0.007945 0.540913 0.051991 0.358597 0.201984
7 S8 25032126 0.739989 0.004133 0.512725 0.058783 0.373509 0.212337
8 S9 22257682 0.747427 0.004869 0.521622 0.063566 0.334294 0.240641
9 S10 29436289 0.748795 0.005499 0.560454 0.036219 0.306729 0.187479
10 S11 31130278 0.741882 0.002740 0.558882 0.049581 0.349191 0.211787
11 S12 21161595 0.750782 0.006837 0.756339 0.013878 0.324264 0.195430
12 S13 28612833 0.733976 0.011718 0.598687 0.035392 0.357447 0.198566
13 S14 26351189 0.748323 0.004106 0.517518 0.070293 0.381095 0.259122
14 S15 25739575 0.748421 0.003353 0.526238 0.050938 0.324207 0.212366
15 S16 26802346 0.739833 0.009370 0.520287 0.071503 0.358758 0.240009
16 S17 26343522 0.749358 0.003155 0.673195 0.024121 0.301588 0.245854
17 S18 25290073 0.749358 0.007465 0.562382 0.048528 0.314776 0.215160
18 10k_rep1 28247826 0.688553 0.018993 0.547000 0.056113 0.484393 0.140333
19 10k_rep2 39303876 0.690313 0.017328 0.547621 0.055600 0.474634 0.142889
20 10k_rep3 29831281 0.710875 0.010610 0.518053 0.066053 0.488738 0.168180
21 MB_SC1 13848219 0.545000 0.007000 0.531495 0.127934 0.207841 0.728980
22 MB_SC2 13550218 0.458000 0.010800 0.569271 0.102581 0.179407 0.694747
23 MB_SC3 26765848 0.496000 0.007900 0.535192 0.141893 0.231068 0.722080
Create a ``flotilla`` Study! ---------------------------- .. code:: python study = flotilla.Study(# The metadata describing phenotype and pooled samples metadata, # A version for this data version='0.1.0', # Dataframe of the filtered expression data expression_data=expression_filtered, # Dataframe of the feature data of the genes expression_feature_data=expression_feature_data, # Dataframe of the splicing data splicing_data=splicing, # Dataframe of the mapping stats data mapping_stats_data=mapping_stats, # Which column in "mapping_stats" has the number of reads mapping_stats_number_mapped_col='PF_READS') .. parsed-literal:: 2014-11-10 12:36:35 Initializing Study 2014-11-10 12:36:35 Initializing Predictor configuration manager for Study 2014-11-10 12:36:35 Predictor ExtraTreesClassifier is of type 2014-11-10 12:36:35 Added ExtraTreesClassifier to default predictors 2014-11-10 12:36:35 Predictor ExtraTreesRegressor is of type 2014-11-10 12:36:35 Added ExtraTreesRegressor to default predictors 2014-11-10 12:36:35 Predictor GradientBoostingClassifier is of type 2014-11-10 12:36:35 Added GradientBoostingClassifier to default predictors 2014-11-10 12:36:35 Predictor GradientBoostingRegressor is of type 2014-11-10 12:36:35 Added GradientBoostingRegressor to default predictors 2014-11-10 12:36:35 Loading metadata 2014-11-10 12:36:35 Loading expression data 2014-11-10 12:36:35 Initializing expression 2014-11-10 12:36:35 Done initializing expression 2014-11-10 12:36:35 Loading splicing data 2014-11-10 12:36:35 Initializing splicing 2014-11-10 12:36:35 Done initializing splicing 2014-11-10 12:36:35 Successfully initialized a Study object! .. parsed-literal:: No phenotype to color mapping was provided, so coming up with reasonable defaults No phenotype to marker (matplotlib plotting symbol) was provided, so each phenotype will be plotted as a circle in the PCA visualizations. As a side note, you can save this study to disk now, so you can "``embark``" later: .. code:: python study.save('shalek2013') .. parsed-literal:: Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json Note that this is saved to my home directory, in ``~/flotilla_projects//``. This will be saved in your home directory, too. The ``datapackage.json`` file is what holds all the information relative to the study, and loosely follows the `datapackage spec `_ created by the Open Knowledge Foundation. .. code:: python cat /Users/olga/flotilla_projects/shalek2013/datapackage.json .. parsed-literal:: { "name": "shalek2013", "title": null, "datapackage_version": "0.1.1", "sources": null, "licenses": null, "resources": [ { "header": [ 0, 1 ], "format": "csv", "compression": "gzip", "name": "splicing", "path": "/Users/olga/flotilla_projects/shalek2013/splicing.csv.gz" }, { "number_mapped_col": "PF_READS", "path": "/Users/olga/flotilla_projects/shalek2013/mapping_stats.csv.gz", "format": "csv", "name": "mapping_stats", "compression": "gzip" }, { "name": "expression_feature", "format": "csv", "rename_col": null, "ignore_subset_cols": [], "path": "/Users/olga/flotilla_projects/shalek2013/expression_feature.csv.gz", "compression": "gzip" }, { "name": "expression", "log_base": null, "format": "csv", "thresh": -Infinity, "path": "/Users/olga/flotilla_projects/shalek2013/expression.csv.gz", "compression": "gzip" }, { "name": "splicing_feature", "format": "csv", "rename_col": "gene_name", "ignore_subset_cols": [], "path": "/Users/olga/flotilla_projects/shalek2013/splicing_feature.csv.gz", "compression": "gzip" }, { "pooled_col": "pooled", "name": "metadata", "phenotype_to_marker": { "BDMC": "o" }, "format": "csv", "minimum_samples": 0, "phenotype_to_color": { "BDMC": "#1b9e77" }, "path": "/Users/olga/flotilla_projects/shalek2013/metadata.csv.gz", "phenotype_col": "phenotype", "phenotype_order": [ "BDMC" ], "compression": "gzip" } ] } One thing to note is that when you save, the version number is bumped up. ``study.version`` (the one we just made) is ``0.1.0``, but the one we saved is ``0.1.1``, since we could have made some changes to the data. Let's look at what else is in this folder: .. code:: python ls /Users/olga/flotilla_projects/shalek2013 .. parsed-literal:: datapackage.json expression_feature.csv.gz metadata.csv.gz splicing_feature.csv.gz expression.csv.gz mapping_stats.csv.gz splicing.csv.gz So this is where all the other files are. Good to know! We can "embark" on this newly-saved study now very painlessly, without having to open and process all those files again: .. code:: python study2 = flotilla.embark('shalek2013') .. parsed-literal:: 2014-11-10 12:36:38 Reading datapackage from /Users/olga/flotilla_projects/shalek2013/datapackage.json 2014-11-10 12:36:38 Parsing datapackage to create a Study object 2014-11-10 12:36:39 Initializing Study 2014-11-10 12:36:39 Initializing Predictor configuration manager for Study 2014-11-10 12:36:39 Predictor ExtraTreesClassifier is of type 2014-11-10 12:36:39 Added ExtraTreesClassifier to default predictors 2014-11-10 12:36:39 Predictor ExtraTreesRegressor is of type 2014-11-10 12:36:39 Added ExtraTreesRegressor to default predictors 2014-11-10 12:36:39 Predictor GradientBoostingClassifier is of type 2014-11-10 12:36:39 Added GradientBoostingClassifier to default predictors 2014-11-10 12:36:39 Predictor GradientBoostingRegressor is of type 2014-11-10 12:36:39 Added GradientBoostingRegressor to default predictors 2014-11-10 12:36:39 Loading metadata 2014-11-10 12:36:39 Loading expression data 2014-11-10 12:36:39 Initializing expression 2014-11-10 12:36:39 Done initializing expression 2014-11-10 12:36:39 Loading splicing data 2014-11-10 12:36:39 Initializing splicing 2014-11-10 12:36:39 Done initializing splicing 2014-11-10 12:36:39 Successfully initialized a Study object! Now we can start creating figures! Figure 1 -------- Here, we will attempt to re-create the sub-panels in `Figure 1 `_, where the original is: .. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f1.2.jpg :align: center :alt: Original Figure 1 Original Figure 1 Figure 1a ~~~~~~~~~ .. code:: python study.plot_two_samples('P1', 'P2') .. parsed-literal:: /usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:1279: UserWarning: findfont: Font family ['Helvetica'] not found. Falling back to Bitstream Vera Sans (prop.get_family(), self.defaultFamily[fontext])) /usr/local/lib/python2.7/site-packages/matplotlib/figure.py:1644: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not " .. image:: shalek2013_files/shalek2013_52_1.png Without flotilla, you would do ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python import seaborn as sns sns.set_style('ticks') x = expression_filtered.ix['P1'] y = expression_filtered.ix['P2'] jointgrid = sns.jointplot(x, y, joint_kws=dict(alpha=0.5)) xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis() jointgrid.ax_joint.set_xlim(0, xmax) jointgrid.ax_joint.set_ylim(0, ymax) .. parsed-literal:: (0, 12.0) .. image:: shalek2013_files/shalek2013_54_1.png Figure 1b ~~~~~~~~~ Paper: :math:`r=0.54`\ . Not sure at all what's going on here. .. code:: python study.plot_two_samples('S1', 'S2') .. image:: shalek2013_files/shalek2013_56_0.png Without ``flotilla`` ~~~~~~~~~~~~~~~~~~~~ .. code:: python import seaborn as sns sns.set_style('ticks') x = expression_filtered.ix['S1'] y = expression_filtered.ix['S2'] jointgrid = sns.jointplot(x, y, joint_kws=dict(alpha=0.5)) # Adjust xmin, ymin to 0 xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis() jointgrid.ax_joint.set_xlim(0, xmax) jointgrid.ax_joint.set_ylim(0, ymax) .. parsed-literal:: (0, 12.0) .. image:: shalek2013_files/shalek2013_58_1.png By the way, you can do other kinds of plots with ``flotilla``, like a kernel density estimate ("``kde``") plot: .. code:: python study.plot_two_samples('S1', 'S2', kind='kde') .. image:: shalek2013_files/shalek2013_60_0.png Or a binned hexagon plot ("``hexbin"``): .. code:: python study.plot_two_samples('S1', 'S2', kind='hexbin') .. image:: shalek2013_files/shalek2013_62_0.png Any inputs that are valid to ``seaborn``'s ```jointplot`` `_ are valid. Figure 1c ~~~~~~~~~ .. code:: python x = study.expression.data.ix['P1'] y = study.expression.singles.mean() y.name = "Average singles" jointgrid = sns.jointplot(x, y, joint_kws=dict(alpha=0.5)) # Adjust xmin, ymin to 0 xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis() jointgrid.ax_joint.set_xlim(0, xmax) jointgrid.ax_joint.set_ylim(0, ymax) .. parsed-literal:: (0, 12.0) .. image:: shalek2013_files/shalek2013_65_1.png Figure 2 -------- Next, we will attempt to recreate the figures from `Figure 2 `_: .. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f2.2.jpg :align: center :alt: Original figure 2 Original figure 2 Figure 2a ~~~~~~~~~ For this figure, we will need the "LPS Response" and "Housekeeping" gene annotations, which weren't very trivial to obtain, so I've moved them to the Appendix. .. code:: python # Get colors for plotting the gene categories dark2 = sns.color_palette('Dark2') singles = study.expression.singles # Get only gene categories for genes in the singles data singles, gene_categories = singles.align(study.expression.feature_data.gene_category, join='left', axis=1) mean = singles.mean() std = singles.std() jointgrid = sns.jointplot(mean, std, color='#262626', joint_kws=dict(alpha=0.5)) for i, (category, s) in enumerate(gene_categories.groupby(gene_categories)): jointgrid.ax_joint.plot(mean[s.index], std[s.index], 'o', color=dark2[i], markersize=5) jointgrid.ax_joint.set_xlabel('Standard deviation in single cells $\mu$') jointgrid.ax_joint.set_ylabel('Average expression in single cells $\sigma$') xmin, xmax, ymin, ymax = jointgrid.ax_joint.axis() vmax = max(xmax, ymax) vmin = min(xmin, ymin) jointgrid.ax_joint.plot([0, vmax], [0, vmax], color='steelblue') jointgrid.ax_joint.plot([0, vmax], [0, .25*vmax], color='grey') jointgrid.ax_joint.set_xlim(0, xmax) jointgrid.ax_joint.set_ylim(0, ymax) jointgrid.ax_joint.fill_betweenx((ymin, ymax), 0, np.log(250), alpha=0.5, zorder=-1) .. parsed-literal:: .. image:: shalek2013_files/shalek2013_67_1.png I couldn't find the data for the ``hESC``s for the right-side panel of Fig. 2a, so I couldn't remake the figure. Figure 2b ~~~~~~~~~ In the paper, they use *"522 most highly expressed genes (single-cell average TPM > 250)"*, but I wasn't able to replicate their numbers. If I use the pre-filtered expression data that I fed into flotilla, then I get 297 genes: .. code:: python mean = study.expression.singles.mean() highly_expressed_genes = mean.index[mean > np.log(250 + 1)] len(highly_expressed_genes) .. parsed-literal:: 297 Which is much less. If I use the original, unfiltered data, then I get the *"522"* number, but this seems strange because they did the filtering step of *"discarded genes not appreciably expressed (transcripts per million (TPM) > 1) in at least three individual cells, retaining 6,313 genes for further analysis"*, and yet they went back to the original data to get this new subset. .. code:: python expression.ix[:, expression.ix[singles_ids].mean() > 250].shape .. parsed-literal:: (21, 522) .. code:: python expression_highly_expressed = np.log(expression.ix[singles_ids, expression.ix[singles_ids].mean() > 250] + 1) mean = expression_highly_expressed.mean() std = expression_highly_expressed.std() mean_bins = pd.cut(mean, bins=np.arange(0, 11, 1)) # Coefficient of variation cv = std/mean cv.sort() genes = mean.index # for name, df in shalek2013.expression.singles.groupby(dict(zip(genes, mean_bins)), axis=1): def calculate_cells_per_tpm_per_cv(df, cv): df = df[df > 1] df_aligned, cv_aligned = df.align(cv, join='inner', axis=1) cv_aligned.sort() n_cells = pd.Series(0, index=cv.index) n_cells[cv_aligned.index] = df_aligned.ix[:, cv_aligned.index].count() return n_cells grouped = expression_highly_expressed.groupby(dict(zip(genes, mean_bins)), axis=1) cells_per_tpm_per_cv = grouped.apply(calculate_cells_per_tpm_per_cv, cv=cv) Here's how you would make the original figure from the paper: .. code:: python fig, ax = plt.subplots(figsize=(10, 10)) sns.heatmap(cells_per_tpm_per_cv, linewidth=0, ax=ax, yticklabels=False) ax.set_yticks([]) ax.set_xlabel('ln(TPM, binned)') .. parsed-literal:: .. image:: shalek2013_files/shalek2013_74_1.png Doesn't quite look the same. Maybe the y-axis labels were opposite, and higher up on the y-axis was less variant? Because I see a blob of color for (1,2] TPM (by the way, the figure in the paper is not TPM+1 as previous figures were) This is how you would make a modified version of the figure, which also plots the coefficient of variation on a side-plot, which I like because it shows the CV changes directly on the heatmap. Also, technically this is :math:`\ln`\ (TPM+1). .. code:: python from matplotlib import gridspec fig = plt.figure(figsize=(12, 10)) gs = gridspec.GridSpec(1, 2, wspace=0.01, hspace=0.01, width_ratios=[.2, 1]) cv_ax = fig.add_subplot(gs[0, 0]) heatmap_ax = fig.add_subplot(gs[0, 1]) sns.heatmap(cells_per_tpm_per_cv, linewidth=0, ax=heatmap_ax) heatmap_ax.set_yticks([]) heatmap_ax.set_xlabel('$\ln$(TPM+1), binned') y = np.arange(cv.shape[0]) cv_ax.set_xscale('log') cv_ax.plot(cv, y, color='#262626') cv_ax.fill_betweenx(cv, np.zeros(cv.shape), y, color='#262626', alpha=0.5) cv_ax.set_ylim(0, y.max()) cv_ax.set_xlabel('CV = $\mu/\sigma$') cv_ax.set_yticks([]) sns.despine(ax=cv_ax, left=True, right=False) .. image:: shalek2013_files/shalek2013_76_0.png Figure 3 -------- We will attempt to re-create the sub-panel figures from `Figure 3 `_: .. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f3.2.jpg :align: center :alt: Original Figure 3 Original Figure 3 Since we can't re-do the microscopy (Figure 3a) or the RNA-FISH counts (Figure 3c), we will make Figures 3b. These histograms are simple to do outside of ``flotilla``, so we do not have them within flotilla. Figure 3b, top panel ~~~~~~~~~~~~~~~~~~~~ .. code:: python fig, ax = plt.subplots() sns.distplot(study.splicing.singles.values.flat, bins=np.arange(0, 1.05, 0.05), ax=ax) ax.set_xlim(0, 1) sns.despine() .. image:: shalek2013_files/shalek2013_79_0.png Figure 3b, bottom panel ~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python fig, ax = plt.subplots() sns.distplot(study.splicing.pooled.values.flat, bins=np.arange(0, 1.05, 0.05), ax=ax, color='grey') ax.set_xlim(0, 1) sns.despine() .. image:: shalek2013_files/shalek2013_81_0.png Figure 4 -------- We will attempt to re-create the sub-panel figures from `Figure 4 `_: .. figure:: http://www.nature.com/nature/journal/v498/n7453/images/nature12172-f4.2.jpg :align: center :alt: Original Figure 4 Original Figure 4 Figure 4a ~~~~~~~~~ Here, we can use the "``interactive_pca``" function we have to explore different dimensionality reductions on the data. .. code:: python study.interactive_pca() .. parsed-literal:: savefile : data/last.pca.pdf y_pc : 2 data_type : expression featurewise : False show_point_labels : False sample_subset : pooled feature_subset : variant plot_violins : True x_pc : 1 list_link : 1100001G20RIK (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] 1110002B05RIK (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] 1600029D21RIK (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] 2310016C08RIK (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] ACSL1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] AK041746 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] ALDOC (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] AMICA1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] APOOL (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] ARG2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] ASF1B (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] ATP6V0D2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] AW112010 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] BANF1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] C1QB (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CAR4 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CAV1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CCL17 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CCL2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CCL22 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CCL7 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CCR2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CCT3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CD200 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CD200R1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CD302 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CD69 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CETN2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CHI3L3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CKS2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CLAST2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] COPS5 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] COX5A (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CXCL1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] CXCL3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] DHRS1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] DYNLT1B (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] EAR11 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] EMR1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] F10 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] FAM103A1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] FCGR3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] FLRT3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] FPR1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] FUCA2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] GLIPR2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] GNGT2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] GPR141 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] H2-AB1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] H2-EB1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] HMGN2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] HP (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IFI30 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IFIT1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IFIT2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IFITM1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IFNB1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IL1R2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IL23A (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] INSL6 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] IRGM1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] KLK1B1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] LAT2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] LCN2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] LGALS1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] LSM3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] LY86 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] LY96 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] MAD2L1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] MED21 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] MGST3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] MMP13 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] MMP8 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] MS4A7 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] N6AMT2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] NAA38 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] NAGK (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] NCF4 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] NDUFA4 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] NUDT9 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] NUPR1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] OASL1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] OAZ1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] PDZD11 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] PFN1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] PMP22 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] PNP2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] PPP1R15A (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] PTGS2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] PTX3 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] QPCT (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] RAB9 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] RFC2 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] ROGDI (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] S100A4 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] SC4MOL (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] SELL (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] SEPP1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] SERPINB6A (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] SERPINB9 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] SLFN1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] SLPI (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] STMN1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TARM1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] THBS1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TM7SF4 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TMEM176A (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TMEM176B (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TMEM39A (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TNFRSF9 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TNFSF4 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TRIM13 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TUBA1A (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TUBA1B (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] TUBB2C (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] UPP1 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] VAMP4 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] VPS25 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] WDR61 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] ZFP263 (18,) verified_color [u'#1b9e77'] verified_order ['BDMC'] ['BDMC'] .. parsed-literal:: .. image:: shalek2013_files/shalek2013_84_2.png .. image:: shalek2013_files/shalek2013_84_3.png A "sequences shortened" version of this is available as a gif: .. figure:: http://i.imgur.com/fJKPQ7W.gif :align: center :alt: Imgur Imgur Equivalently, I could have written out the plotting command by hand, instead of using ``study.interactive_pca``: .. code:: python study.plot_pca(feature_subset='gene_category: LPS Response', sample_subset='~pooled', plot_violins=False) .. parsed-literal:: .. image:: shalek2013_files/shalek2013_87_1.png Without ``flotilla``, ``plot_pca`` is quite a bit of code: .. code:: python import sys from collections import defaultdict from itertools import cycle import math from sklearn import decomposition from sklearn.preprocessing import StandardScaler import pandas as pd from matplotlib.gridspec import GridSpec, GridSpecFromSubplotSpec import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns from flotilla.visualize.color import dark2 from flotilla.visualize.generic import violinplot class DataFrameReducerBase(object): """ Just like scikit-learn's reducers, but with prettied up DataFrames. """ def __init__(self, df, n_components=None, **decomposer_kwargs): # This magically initializes the reducer like DataFramePCA or DataFrameNMF if df.shape[1] <= 3: raise ValueError( "Too few features (n={}) to reduce".format(df.shape[1])) super(DataFrameReducerBase, self).__init__(n_components=n_components, **decomposer_kwargs) self.reduced_space = self.fit_transform(df) def relabel_pcs(self, x): return "pc_" + str(int(x) + 1) def fit(self, X): try: assert type(X) == pd.DataFrame except AssertionError: sys.stdout.write("Try again as a pandas DataFrame") raise ValueError('Input X was not a pandas DataFrame, ' 'was of type {} instead'.format(str(type(X)))) self.X = X super(DataFrameReducerBase, self).fit(X) self.components_ = pd.DataFrame(self.components_, columns=self.X.columns).rename_axis( self.relabel_pcs, 0) try: self.explained_variance_ = pd.Series( self.explained_variance_).rename_axis(self.relabel_pcs, 0) self.explained_variance_ratio_ = pd.Series( self.explained_variance_ratio_).rename_axis(self.relabel_pcs, 0) except AttributeError: pass return self def transform(self, X): component_space = super(DataFrameReducerBase, self).transform(X) if type(self.X) == pd.DataFrame: component_space = pd.DataFrame(component_space, index=X.index).rename_axis( self.relabel_pcs, 1) return component_space def fit_transform(self, X): try: assert type(X) == pd.DataFrame except: sys.stdout.write("Try again as a pandas DataFrame") raise ValueError('Input X was not a pandas DataFrame, ' 'was of type {} instead'.format(str(type(X)))) self.fit(X) return self.transform(X) class DataFramePCA(DataFrameReducerBase, decomposition.PCA): pass class DataFrameNMF(DataFrameReducerBase, decomposition.NMF): def fit(self, X): """ duplicated fit code for DataFrameNMF because sklearn's NMF cheats for efficiency and calls fit_transform. MRO resolves the closest (in this package) _single_fit_transform first and so there's a recursion error: def fit(self, X, y=None, **params): self._single_fit_transform(X, **params) return self """ try: assert type(X) == pd.DataFrame except: sys.stdout.write("Try again as a pandas DataFrame") raise ValueError('Input X was not a pandas DataFrame, ' 'was of type {} instead'.format(str(type(X)))) self.X = X # notice this is fit_transform, not fit super(decomposition.NMF, self).fit_transform(X) self.components_ = pd.DataFrame(self.components_, columns=self.X.columns).rename_axis( self.relabel_pcs, 0) return self class DataFrameICA(DataFrameReducerBase, decomposition.FastICA): pass class DecompositionViz(object): """ Plots the reduced space from a decomposed dataset. Does not perform any reductions of its own """ def __init__(self, reduced_space, components_, explained_variance_ratio_, feature_renamer=None, groupby=None, singles=None, pooled=None, outliers=None, featurewise=False, order=None, violinplot_kws=None, data_type='expression', label_to_color=None, label_to_marker=None, scale_by_variance=True, x_pc='pc_1', y_pc='pc_2', n_vectors=20, distance='L1', n_top_pc_features=50, max_char_width=30): """Plot the results of a decomposition visualization Parameters ---------- reduced_space : pandas.DataFrame A (n_samples, n_dimensions) DataFrame of the post-dimensionality reduction data components_ : pandas.DataFrame A (n_features, n_dimensions) DataFrame of how much each feature contributes to the components (trailing underscore to be consistent with scikit-learn) explained_variance_ratio_ : pandas.Series A (n_dimensions,) Series of how much variance each component explains. (trailing underscore to be consistent with scikit-learn) feature_renamer : function, optional A function which takes the name of the feature and renames it, e.g. from an ENSEMBL ID to a HUGO known gene symbol. If not provided, the original name is used. groupby : mapping function | dict, optional A mapping of the samples to a label, e.g. sample IDs to phenotype, for the violinplots. If None, all samples are treated the same and are colored the same. singles : pandas.DataFrame, optional For violinplots only. If provided and 'plot_violins' is True, will plot the raw (not reduced) measurement values as violin plots. pooled : pandas.DataFrame, optional For violinplots only. If provided, pooled samples are plotted as black dots within their label. outliers : pandas.DataFrame, optional For violinplots only. If provided, outlier samples are plotted as a grey shadow within their label. featurewise : bool, optional If True, then the "samples" are features, e.g. genes instead of samples, and the "features" are the samples, e.g. the cells instead of the gene ids. Essentially, the transpose of the original matrix. If True, then violins aren't plotted. (default False) order : list-like The order of the labels for the violinplots, e.g. if the data is from a differentiation timecourse, then this would be the labels of the phenotypes, in the differentiation order. violinplot_kws : dict Any additional parameters to violinplot data_type : 'expression' | 'splicing', optional For violinplots only. The kind of data that was originally used for the reduction. (default 'expression') label_to_color : dict, optional A mapping of the label, e.g. the phenotype, to the desired plotting color (default None, auto-assigned with the groupby) label_to_marker : dict, optional A mapping of the label, e.g. the phenotype, to the desired plotting symbol (default None, auto-assigned with the groupby) scale_by_variance : bool, optional If True, scale the x- and y-axes by their explained_variance_ratio_ (default True) {x,y}_pc : str, optional Principal component to plot on the x- and y-axis. (default "pc_1" and "pc_2") n_vectors : int, optional Number of vectors to plot of the principal components. (default 20) distance : 'L1' | 'L2', optional The distance metric to use to plot the vector lengths. L1 is "Cityblock", i.e. the sum of the x and y coordinates, and L2 is the traditional Euclidean distance. (default "L1") n_top_pc_features : int, optional THe number of top features from the principal components to plot. (default 50) max_char_width : int, optional Maximum character width of a feature name. Useful for crazy long feature IDs like MISO IDs """ self.reduced_space = reduced_space self.components_ = components_ self.explained_variance_ratio_ = explained_variance_ratio_ self.singles = singles self.pooled = pooled self.outliers = outliers self.groupby = groupby self.order = order self.violinplot_kws = violinplot_kws if violinplot_kws is not None \ else {} self.data_type = data_type self.label_to_color = label_to_color self.label_to_marker = label_to_marker self.n_vectors = n_vectors self.x_pc = x_pc self.y_pc = y_pc self.pcs = (self.x_pc, self.y_pc) self.distance = distance self.n_top_pc_features = n_top_pc_features self.featurewise = featurewise self.feature_renamer = feature_renamer self.max_char_width = max_char_width if self.label_to_color is None: colors = cycle(dark2) def color_factory(): return colors.next() self.label_to_color = defaultdict(color_factory) if self.label_to_marker is None: markers = cycle(['o', '^', 's', 'v', '*', 'D', 'h']) def marker_factory(): return markers.next() self.label_to_marker = defaultdict(marker_factory) if self.groupby is None: self.groupby = dict.fromkeys(self.reduced_space.index, 'all') self.grouped = self.reduced_space.groupby(self.groupby, axis=0) if order is not None: self.color_ordered = [self.label_to_color[x] for x in self.order] else: self.color_ordered = [self.label_to_color[x] for x in self.grouped.groups] self.loadings = self.components_.ix[[self.x_pc, self.y_pc]] # Get the explained variance if explained_variance_ratio_ is not None: self.vars = explained_variance_ratio_[[self.x_pc, self.y_pc]] else: self.vars = pd.Series([1., 1.], index=[self.x_pc, self.y_pc]) if scale_by_variance: self.loadings = self.loadings.multiply(self.vars, axis=0) # sort features by magnitude/contribution to transformation reduced_space = self.reduced_space[[self.x_pc, self.y_pc]] farthest_sample = reduced_space.apply(np.linalg.norm, axis=0).max() whole_space = self.loadings.apply(np.linalg.norm).max() scale = .25 * farthest_sample / whole_space self.loadings *= scale ord = 2 if self.distance == 'L2' else 1 self.magnitudes = self.loadings.apply(np.linalg.norm, ord=ord) self.magnitudes.sort(ascending=False) self.top_features = set([]) self.pc_loadings_labels = {} self.pc_loadings = {} for pc in self.pcs: x = self.components_.ix[pc].copy() x.sort(ascending=True) half_features = int(self.n_top_pc_features / 2) if len(x) > self.n_top_pc_features: a = x[:half_features] b = x[-half_features:] labels = np.r_[a.index, b.index] self.pc_loadings[pc] = np.r_[a, b] else: labels = x.index self.pc_loadings[pc] = x self.pc_loadings_labels[pc] = labels self.top_features.update(labels) def __call__(self, ax=None, title='', plot_violins=True, show_point_labels=False, show_vectors=True, show_vector_labels=True, markersize=10, legend=True): gs_x = 14 gs_y = 12 if ax is None: self.reduced_fig, ax = plt.subplots(1, 1, figsize=(20, 10)) gs = GridSpec(gs_x, gs_y) else: gs = GridSpecFromSubplotSpec(gs_x, gs_y, ax.get_subplotspec()) self.reduced_fig = plt.gcf() ax_components = plt.subplot(gs[:, :5]) ax_loading1 = plt.subplot(gs[:, 6:8]) ax_loading2 = plt.subplot(gs[:, 10:14]) self.plot_samples(show_point_labels=show_point_labels, title=title, show_vectors=show_vectors, show_vector_labels=show_vector_labels, markersize=markersize, legend=legend, ax=ax_components) self.plot_loadings(pc=self.x_pc, ax=ax_loading1) self.plot_loadings(pc=self.y_pc, ax=ax_loading2) sns.despine() self.reduced_fig.tight_layout() if plot_violins and not self.featurewise and self.singles is not None: self.plot_violins() return self def shorten(self, x): if len(x) > self.max_char_width: return '{}...'.format(x[:self.max_char_width]) else: return x def plot_samples(self, show_point_labels=True, title='DataFramePCA', show_vectors=True, show_vector_labels=True, markersize=10, three_d=False, legend=True, ax=None): """ Given a pandas dataframe, performs DataFramePCA and plots the results in a convenient single function. Parameters ---------- groupby : groupby How to group the samples by color/label label_to_color : dict Group labels to a matplotlib color E.g. if you've already chosen specific colors to indicate a particular group. Otherwise will auto-assign colors label_to_marker : dict Group labels to matplotlib marker title : str title of the plot show_vectors : bool Whether or not to draw the vectors indicating the supporting principal components show_vector_labels : bool whether or not to draw the names of the vectors show_point_labels : bool Whether or not to label the scatter points markersize : int size of the scatter markers on the plot text_group : list of str Group names that you want labeled with text three_d : bool if you want hte plot in 3d (need to set up the axes beforehand) Returns ------- For each vector in data: x, y, marker, distance """ if ax is None: ax = plt.gca() # Plot the samples for name, df in self.grouped: color = self.label_to_color[name] marker = self.label_to_marker[name] x = df[self.x_pc] y = df[self.y_pc] ax.plot(x, y, color=color, marker=marker, linestyle='None', label=name, markersize=markersize, alpha=0.75, markeredgewidth=.1) try: if not self.pooled.empty: pooled_ids = x.index.intersection(self.pooled.index) pooled_x, pooled_y = x[pooled_ids], y[pooled_ids] ax.plot(pooled_x, pooled_y, 'o', color=color, marker=marker, markeredgecolor='k', markeredgewidth=2, label='{} pooled'.format(name), markersize=markersize, alpha=0.75) except AttributeError: pass try: if not self.outliers.empty: outlier_ids = x.index.intersection(self.outliers.index) outlier_x, outlier_y = x[outlier_ids], y[outlier_ids] ax.plot(outlier_x, outlier_y, 'o', color=color, marker=marker, markeredgecolor='lightgrey', markeredgewidth=5, label='{} outlier'.format(name), markersize=markersize, alpha=0.75) except AttributeError: pass if show_point_labels: for args in zip(x, y, df.index): ax.text(*args) # Plot vectors, if asked if show_vectors: for vector_label in self.magnitudes[:self.n_vectors].index: x, y = self.loadings[vector_label] ax.plot([0, x], [0, y], color='k', linewidth=1) if show_vector_labels: x_offset = math.copysign(5, x) y_offset = math.copysign(5, y) horizontalalignment = 'left' if x > 0 else 'right' if self.feature_renamer is not None: renamed = self.feature_renamer(vector_label) else: renamed = vector_label ax.annotate(renamed, (x, y), textcoords='offset points', xytext=(x_offset, y_offset), horizontalalignment=horizontalalignment) # Label x and y axes ax.set_xlabel( 'Principal Component {} (Explains {:.2f}% Of Variance)'.format( str(self.x_pc), 100 * self.vars[self.x_pc])) ax.set_ylabel( 'Principal Component {} (Explains {:.2f}% Of Variance)'.format( str(self.y_pc), 100 * self.vars[self.y_pc])) ax.set_title(title) if legend: ax.legend() sns.despine() def plot_loadings(self, pc='pc_1', n_features=50, ax=None): loadings = self.pc_loadings[pc] labels = self.pc_loadings_labels[pc] if ax is None: ax = plt.gca() ax.plot(loadings, np.arange(loadings.shape[0]), 'o') ax.set_yticks(np.arange(max(loadings.shape[0], n_features))) ax.set_title("Component " + pc) x_offset = max(loadings) * .05 ax.set_xlim(left=loadings.min() - x_offset, right=loadings.max() + x_offset) if self.feature_renamer is not None: labels = map(self.feature_renamer, labels) else: labels = labels labels = map(self.shorten, labels) # ax.set_yticklabels(map(shorten, labels)) ax.set_yticklabels(labels) for lab in ax.get_xticklabels(): lab.set_rotation(90) sns.despine(ax=ax) def plot_explained_variance(self, title="PCA explained variance"): """If the reducer is a form of PCA, then plot the explained variance ratio by the components. """ # Plot the explained variance ratio assert self.explained_variance_ratio_ is not None import matplotlib.pyplot as plt import seaborn as sns fig, ax = plt.subplots() ax.plot(self.explained_variance_ratio_, 'o-') xticks = np.arange(len(self.explained_variance_ratio_)) ax.set_xticks(xticks) ax.set_xticklabels(xticks + 1) ax.set_xlabel('Principal component') ax.set_ylabel('Fraction explained variance') ax.set_title(title) sns.despine() def plot_violins(self): """Make violinplots of each feature Must be called after plot_samples because it depends on the existence of the "self.magnitudes" attribute. """ ncols = 4 nrows = 1 vector_labels = list(set(self.magnitudes[:self.n_vectors].index.union( pd.Index(self.top_features)))) while ncols * nrows < len(vector_labels): nrows += 1 self.violins_fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(4 * ncols, 4 * nrows)) if self.feature_renamer is not None: renamed_vectors = map(self.feature_renamer, vector_labels) else: renamed_vectors = vector_labels labels = [(y, x) for (y, x) in sorted(zip(renamed_vectors, vector_labels))] for (renamed, feature_id), ax in zip(labels, axes.flat): singles = self.singles[feature_id] if self.singles is not None \ else None pooled = self.pooled[feature_id] if self.pooled is not None else \ None outliers = self.outliers[feature_id] if self.outliers is not None \ else None title = '{}\n{}'.format(feature_id, renamed) violinplot(singles, pooled_data=pooled, outliers=outliers, groupby=self.groupby, color_ordered=self.color_ordered, order=self.order, title=title, ax=ax, data_type=self.data_type, **self.violinplot_kws) # Clear any unused axes for ax in axes.flat: # Check if the plotting space is empty if len(ax.collections) == 0 or len(ax.lines) == 0: ax.axis('off') self.violins_fig.tight_layout() # Notice we're using the original data, nothing from "study" lps_response_genes = expression_feature_data.index[expression_feature_data.gene_category == 'LPS Response'] subset = expression_filtered.ix[singles_ids, lps_response_genes].dropna(how='all', axis=1) subset_standardized = pd.DataFrame(StandardScaler().fit_transform(subset), index=subset.index, columns=subset.columns) pca = DataFramePCA(subset_standardized) visualizer = DecompositionViz(pca.reduced_space, pca.components_, pca.explained_variance_ratio_) visualizer() .. parsed-literal:: <__main__.DecompositionViz at 0x1137f8450> .. image:: shalek2013_files/shalek2013_89_1.png Figure 4b ~~~~~~~~~ .. code:: python lps_response_genes = study.expression.feature_subsets['gene_category: LPS Response'] lps_response = study.expression.singles.ix[:, lps_response_genes].dropna(how='all', axis=1) lps_response.head() .. raw:: html
GENE 1110018G07RIK 1110038F14RIK 1200009I06RIK 1600014C10RIK 1810029B16RIK 2210009G21RIK 2810474O19RIK 3110001I22RIK 4921513D23RIK 4930523C07RIK ... ZC3H12C ZC3HAV1 ZCCHC2 ZCCHC6 ZDHHC21 ZFP36 ZFP800 ZHX2 ZNFX1 ZUFSP
S1 3.711442 0.000000 3.275468 0.000000 5.609305 0 0.000000 3.828860 1.314573 3.778275 ... 3.972904 3.509979 0.035344 3.042277 4.425735 4.092559 4.025124 0.779382 2.998800 0.000000
S2 4.361671 0.147643 0.000000 0.000000 5.478071 0 3.407342 0.000000 1.531443 0.000000 ... 4.794306 4.984262 2.251330 1.018315 4.955713 0.356008 4.297776 0.032569 3.091207 5.000843
S3 0.000000 3.737014 2.987093 0.063526 5.320993 0 3.372359 0.058163 1.105115 0.025043 ... 4.882749 0.807258 0.094925 0.126673 3.952273 1.956983 0.000000 0.000000 3.794063 2.928699
S4 2.719587 0.000000 0.045823 0.000000 0.488049 0 5.127847 0.000000 2.303969 0.000000 ... 4.833354 4.538699 0.137427 2.025546 4.193989 2.372572 0.121924 0.000000 0.230278 0.430168
S5 2.982073 0.000000 2.829152 0.000000 5.093188 0 0.065122 4.635671 1.015640 0.461296 ... 4.446634 0.157178 0.616401 0.000000 4.039816 0.000000 4.714087 1.565475 0.860254 4.866979

5 rows × 630 columns

.. code:: python lps_response_corr = lps_response.corr() "Elbow method" for determining number of clusters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The authors state that they used the "Elbow method" to determine the `number of cluster centers `_. Essentially, you try a bunch of different :math:`k`\ , and see where there is a flattening out of the metric, like an elbow. There's a few different variations on which metric to use, such as using the average distance to the cluster center, or the explained variance. Let's try the distance to cluster center first, because ``scikit-learn`` makes it easy. .. code:: python from sklearn.cluster import KMeans ##### cluster data into K=1..10 clusters ##### ks = np.arange(1, 11).astype(int) X = lps_response_corr.values kmeans = [KMeans(n_clusters=k).fit(X) for k in ks] # Scikit-learn makes this easy by computing the distance to the nearest center dist_to_center = [km.inertia_ for km in kmeans] fig, ax = plt.subplots() ax.plot(ks, dist_to_center, 'o-') ax.set_ylabel('Sum of distance to nearest cluster center') sns.despine() .. image:: shalek2013_files/shalek2013_95_0.png Not quite sure where the elbow is here. looks like there's a big drop off after :math:`k=1`\ , but that could just be an illusion. Since they didn't specify which version of the elbow method they used, I'm not going to investigate this further, and just see if we can see what they see with the :math:`k=5` clusters that they found was optimal. .. code:: python kmeans = KMeans(n_clusters=5) lps_response_corr_clusters = kmeans.fit_predict(lps_response_corr.values) lps_response_corr_clusters .. parsed-literal:: array([1, 1, 2, 4, 0, 1, 3, 4, 2, 4, 0, 3, 2, 2, 3, 3, 0, 1, 0, 3, 0, 1, 2, 0, 0, 3, 3, 2, 4, 4, 0, 4, 4, 0, 4, 0, 3, 4, 2, 1, 4, 4, 4, 3, 1, 4, 0, 0, 4, 1, 3, 3, 4, 0, 0, 2, 0, 0, 0, 4, 1, 3, 4, 3, 3, 4, 2, 2, 4, 3, 0, 4, 0, 3, 4, 2, 2, 4, 2, 3, 3, 3, 1, 1, 4, 1, 2, 2, 2, 1, 1, 3, 1, 1, 4, 3, 3, 3, 3, 1, 1, 4, 0, 2, 0, 0, 2, 3, 4, 4, 2, 0, 0, 3, 2, 4, 0, 2, 4, 3, 4, 2, 2, 4, 1, 4, 3, 0, 1, 2, 3, 3, 4, 4, 0, 0, 4, 3, 2, 1, 0, 4, 2, 0, 4, 2, 4, 0, 1, 0, 0, 3, 3, 3, 3, 1, 4, 3, 0, 2, 2, 3, 4, 1, 3, 4, 2, 2, 2, 3, 3, 3, 4, 0, 1, 3, 1, 0, 2, 1, 1, 0, 2, 4, 0, 1, 3, 2, 1, 3, 0, 1, 1, 2, 4, 3, 1, 0, 0, 0, 3, 3, 2, 1, 3, 1, 1, 4, 4, 3, 2, 3, 3, 1, 4, 3, 4, 3, 0, 1, 3, 3, 3, 3, 3, 1, 4, 1, 0, 3, 3, 2, 4, 3, 2, 0, 0, 3, 1, 1, 4, 3, 2, 4, 4, 3, 1, 1, 1, 0, 4, 1, 1, 0, 0, 4, 0, 0, 0, 3, 4, 3, 4, 3, 3, 3, 3, 0, 3, 4, 1, 2, 4, 1, 2, 2, 0, 0, 0, 4, 0, 2, 4, 0, 2, 4, 0, 4, 0, 3, 1, 3, 2, 3, 0, 3, 3, 3, 2, 1, 2, 2, 2, 2, 2, 4, 3, 2, 4, 3, 2, 2, 3, 1, 4, 1, 0, 0, 0, 0, 1, 4, 1, 4, 4, 3, 1, 1, 1, 1, 1, 2, 1, 2, 0, 4, 3, 4, 0, 3, 3, 3, 0, 3, 2, 2, 3, 0, 0, 2, 4, 4, 0, 1, 1, 3, 4, 2, 0, 3, 3, 0, 1, 0, 0, 4, 3, 2, 3, 3, 0, 2, 3, 3, 1, 1, 1, 3, 4, 2, 2, 2, 2, 3, 3, 2, 0, 1, 1, 1, 3, 3, 3, 4, 4, 0, 4, 3, 3, 1, 0, 0, 3, 3, 0, 3, 3, 0, 1, 4, 4, 4, 3, 4, 3, 1, 3, 1, 2, 2, 1, 4, 0, 1, 0, 3, 2, 0, 1, 1, 2, 4, 1, 0, 3, 0, 3, 3, 1, 3, 1, 4, 3, 1, 1, 2, 4, 1, 4, 2, 3, 0, 1, 4, 3, 2, 3, 3, 1, 1, 2, 3, 1, 2, 1, 0, 0, 4, 3, 3, 1, 3, 4, 0, 3, 0, 4, 0, 4, 1, 4, 0, 1, 3, 0, 1, 0, 4, 3, 2, 2, 3, 3, 1, 0, 2, 4, 1, 1, 4, 0, 2, 2, 3, 2, 4, 1, 0, 3, 4, 2, 1, 1, 3, 3, 0, 0, 0, 3, 3, 1, 0, 3, 2, 3, 3, 0, 2, 0, 1, 3, 0, 3, 4, 4, 1, 2, 4, 2, 3, 3, 3, 2, 3, 2, 4, 4, 1, 1, 4, 3, 2, 2, 4, 0, 1, 2, 0, 3, 0, 4, 0, 2, 1, 1, 3, 0, 2, 1, 3, 3, 1, 2, 0, 0, 1, 2, 0, 0, 4, 3, 4, 0, 1, 4, 1, 3, 0, 2, 2, 2, 0, 3, 1, 3, 1, 4, 3, 4, 2, 2, 3, 2, 0, 0, 4, 2, 1, 1, 1, 3, 3, 1, 2, 2, 0, 1, 2, 3, 0, 4, 1, 3, 2, 2, 2, 1, 2, 1, 4], dtype=int32) Now let's create a dataframe with these genes in their cluster orders. .. code:: python gene_to_cluster = dict(zip(lps_response_corr.columns, lps_response_corr_clusters)) dfs = [] for name, df in lps_response_corr.groupby(gene_to_cluster): dfs.append(df) lps_response_corr_ordered_by_clusters = pd.concat(dfs) # Make symmetric, since we created this dataframe by smashing rows on top of each other, we need to reorder the columns lps_response_corr_ordered_by_clusters = lps_response_corr_ordered_by_clusters.ix[:, lps_response_corr_ordered_by_clusters.index] lps_response_corr_ordered_by_clusters.head() .. raw:: html
GENE 1810029B16RIK 6330409N04RIK A130040M12RIK A630001G21RIK AA467197 ACPP ACSL1 AI607873 AK035387 AK042010 ... TAPBPL TIMP1 TNFSF4 TNFSF9 TOR1AIP1 TRIM34 TTC39C USP12 ZC3H12C ZUFSP
GENE
1810029B16RIK 1.000000 -0.394008 0.178867 0.271292 0.150375 0.106683 0.182380 0.076501 0.032006 0.335728 ... -0.048557 0.029439 -0.194118 0.098728 0.228197 0.214306 0.204341 -0.300112 0.052816 -0.305531
6330409N04RIK -0.394008 1.000000 0.212542 -0.293282 0.119868 0.468823 0.298642 0.320296 0.158348 0.273508 ... -0.125377 0.027740 0.190240 -0.422914 -0.400675 -0.197247 -0.023442 0.088776 -0.006295 -0.412691
A130040M12RIK 0.178867 0.212542 1.000000 0.135900 0.262295 0.591539 0.101211 0.198424 -0.061268 0.261002 ... 0.131664 0.268447 0.004338 -0.043116 -0.052081 -0.336334 0.322725 0.013706 0.254403 0.048982
A630001G21RIK 0.271292 -0.293282 0.135900 1.000000 0.420200 0.317858 -0.312477 0.030636 0.260572 0.030842 ... 0.121712 -0.313992 -0.128168 0.059445 -0.177031 -0.332522 0.027750 -0.125732 0.071474 0.070166
AA467197 0.150375 0.119868 0.262295 0.420200 1.000000 0.277036 0.020186 0.747215 0.404682 0.342458 ... -0.023484 0.198358 0.452022 -0.467622 -0.325064 -0.029880 0.036769 0.244542 -0.093778 0.108472

5 rows × 630 columns

The next step is to get the principal-component reduced data, using only the LPS response genes. We can do this in ``flotilla`` using ``study.expression.reduce``. .. code:: python reduced = study.expression.reduce(singles_ids, feature_ids=lps_response_genes) We can get the principal components using ``reduced.components_`` (similar interface as ``scikit-learn``). .. code:: python reduced.components_.head() .. raw:: html
MOV10 PPAP2B LASS6 TMCO3 CPD AK138792 TARM1 P4HA1 CD180 SMG7 ... OAS1B OAS1G AK151815 GTPBP2 PRPF38A SLC7A11 PCDH7 GNA13 PTPRJ ATF3
pc_1 0.035299 0.038725 -0.006343 0.014219 0.033734 -0.079831 0.032886 0.034783 0.033719 -0.048453 ... -0.022490 0.031091 -0.021397 0.034917 0.001745 0.058000 0.007748 0.000767 0.016012 0.018020
pc_2 0.055310 0.002925 -0.043986 -0.024020 -0.061957 -0.016327 0.002882 -0.003178 0.050055 0.038601 ... 0.012240 0.052127 0.009120 0.077015 0.072064 -0.080902 -0.056607 0.068444 -0.072533 0.068088
pc_3 0.000374 0.099514 -0.039636 0.003997 -0.000575 -0.042212 -0.056827 0.015571 -0.039811 0.005398 ... -0.010524 -0.009277 -0.102462 -0.043913 -0.052513 -0.030622 0.022607 -0.002503 0.023997 -0.054205
pc_4 0.022491 0.002342 0.009422 -0.034725 0.025866 -0.009656 -0.027689 -0.089803 -0.046888 0.002274 ... -0.003404 -0.070307 -0.007025 0.003407 -0.048078 0.028099 0.032970 -0.066284 0.010371 -0.006108
pc_5 -0.025743 -0.009200 -0.030187 -0.061283 0.010464 0.032668 0.012223 -0.047623 -0.047351 0.045909 ... -0.074817 0.044218 -0.000884 -0.000597 -0.033893 -0.018108 -0.012669 -0.025833 -0.044248 -0.001995

5 rows × 630 columns

.. code:: python pc_components = reduced.components_.ix[:2, lps_response_corr_ordered_by_clusters.index].T pc_components.head() .. raw:: html
pc_1 pc_2
GENE
1810029B16RIK 0.042928 -0.005064
6330409N04RIK 0.033104 -0.015244
A130040M12RIK 0.043078 0.016664
A630001G21RIK 0.023375 -0.007379
AA467197 0.054588 0.018045
.. code:: python import matplotlib as mpl fig = plt.figure(figsize=(12, 10)) gs = gridspec.GridSpec(2, 2, wspace=0.1, hspace=0.1, width_ratios=[1, .2], height_ratios=[1, .1]) corr_ax = fig.add_subplot(gs[0, 0]) corr_cbar_ax = fig.add_subplot(gs[1, 0]) pc_ax = fig.add_subplot(gs[0, 1:]) pc_cbar_ax = fig.add_subplot(gs[1:, 1:]) sns.heatmap(lps_response_corr_ordered_by_clusters, linewidth=0, ax=corr_ax, cbar_ax=corr_cbar_ax, cbar_kws=dict(orientation='horizontal')) sns.heatmap(pc_components, cmap=mpl.cm.PRGn, linewidth=0, ax=pc_ax, cbar_ax=pc_cbar_ax, cbar_kws=dict(orientation='horizontal')) corr_ax.set_xlabel('') corr_ax.set_ylabel('') corr_ax.set_xticks([]) corr_ax.set_yticks([]) pc_ax.set_yticks([]) pc_ax.set_ylabel('') .. parsed-literal:: .. image:: shalek2013_files/shalek2013_105_1.png This looks pretty similar, maybe just rearranged cluster order. Let's check what their data looks like when you plot this. Their PC scores and clusters for the genes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python gene_pc_clusters = pd.read_excel('nature12172-s1/Supplementary_Table5.xls', index_col=0) gene_pc_clusters.head() .. raw:: html
Annotation Cluster PC1 Score PC2 Score
Gene
LNPEP NaN 1 0.232368 0.677266
TOR1AIP2 Antiv 1 -0.075934 1.485877
TNFSF4 NaN 1 0.497893 -0.562412
CFB Inflam 1 -0.394318 1.277749
H2-T10 NaN 1 0.514947 -0.698538
.. code:: python data = lps_response_corr.ix[gene_pc_clusters.index, gene_pc_clusters.index].dropna(how='all', axis=0).dropna(how='all', axis=1) fig = plt.figure(figsize=(12, 10)) gs = gridspec.GridSpec(2, 2, wspace=0.1, hspace=0.1, width_ratios=[1, .2], height_ratios=[1, .1]) corr_ax = fig.add_subplot(gs[0, 0]) corr_cbar_ax = fig.add_subplot(gs[1, 0]) pc_ax = fig.add_subplot(gs[0, 1:]) pc_cbar_ax = fig.add_subplot(gs[1:, 1:]) sns.heatmap(data, linewidth=0, square=True, vmin=-1, vmax=1, ax=corr_ax, cbar_ax=corr_cbar_ax, cbar_kws=dict(orientation='horizontal')) sns.heatmap(gene_pc_clusters.ix[:, ['PC1 Score', 'PC2 Score']], linewidth=0, cmap=mpl.cm.PRGn, ax=pc_ax, cbar_ax=pc_cbar_ax, cbar_kws=dict(orientation='horizontal'), xticklabels=False, yticklabels=False) corr_ax.set_xlabel('') corr_ax.set_ylabel('') corr_ax.set_xticks([]) corr_ax.set_yticks([]) pc_ax.set_yticks([]) pc_ax.set_ylabel('') .. parsed-literal:: .. image:: shalek2013_files/shalek2013_109_1.png Sure enough, if I use their annotations, I get exactly that. Though there were two genes in their file that I didn't have in the ``lps_response_corr`` data: .. code:: python gene_pc_clusters.index.difference(lps_response_corr.index) :: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in () ----> 1 gene_pc_clusters.index.difference(lps_response_corr.index) /usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in difference(self, other) 1325 result_name = self.name if self.name == other.name else None 1326 -> 1327 theDiff = sorted(set(self) - set(other)) 1328 return Index(theDiff, name=result_name) 1329 TypeError: can't compare datetime.datetime to unicode Oh joy, another ``datetime`` error, just like we had with ``expression2``... Looking back at the original Excel file, there is one gene that Excel mangled to be a date: .. figure:: https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_supplementary_table_5_datetime_error.png :align: center :alt: Please, can we start using just plain ole ``.csv``s for supplementary data! Excel does NOT preserve strings if they start with numbers, and instead thinks they are dates. .. code:: python import collections collections.Counter(gene_pc_clusters.index.map(type)) .. parsed-literal:: Counter({: 631, : 1}) Yep, it's just that one that got mangled.... oh well. .. code:: python gene_pc_clusters_genes = set(filter(lambda x: isinstance(x, unicode), gene_pc_clusters.index)) gene_pc_clusters_genes.difference(lps_response_corr.index) .. parsed-literal:: {u'RPS6KA2'} So, "``RPS6KA2``" is the only gene that was in their list of genes and not in mine. Supplementary figures --------------------- Now we get to have even more fun by plotting the Supplementary figures! :D Ironically, the supplementary figures are usually way easier to access (like not behind a paywall), and yet they're usually the documents that really have the crucial information about the experiments. Supplementary Figure 1 ~~~~~~~~~~~~~~~~~~~~~~ .. figure:: https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig1.png :align: center :alt: Supplementary figure 1, a correlation plot Supplementary figure 1, a correlation plot .. code:: python singles_mean = study.expression.singles.mean() singles_mean.name = 'Single cell average' # Need to convert "average_singles" to a DataFrame instead of a single-row Series singles_mean = pd.DataFrame(singles_mean) singles_mean.head() .. raw:: html
Single cell average
GENE
NPL 1.075740
QK 2.019888
AK163153 1.429369
PARK2 0.596479
AGPAT4 2.021294
.. code:: python data_for_correlations = pd.concat([study.expression.singles, singles_mean.T, study.expression.pooled]) # Take the transpose of the data, because the plotting algorithm calculates correlations between columns, # And we want the correlations between samples, not features data_for_correlations = data_for_correlations.T data_for_correlations.head() # %time sns.corrplot(data_for_correlations) .. raw:: html
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 ... S13 S14 S15 S16 S17 S18 Single cell average P1 P2 P3
GENE
NPL 4.290577 0.000000 4.860293 0.090829 0.000000 0.000000 4.730129 4.657090 0.112641 0.000000 ... 0.110470 0.099121 0.100920 0.206361 0.104884 0.000000 1.075740 2.093019 2.044724 2.742480
QK 5.038477 4.183371 3.847854 0.066797 3.305915 0.114225 3.730270 2.750103 0.134389 0.760353 ... 3.395885 2.294456 0.301120 3.547688 2.185832 0.040923 2.019888 3.869102 3.690982 3.671838
AK163153 1.249363 1.947622 1.082463 1.119633 1.267464 0.901824 1.033401 0.978591 1.220720 1.035237 ... 2.103135 1.110511 1.202271 4.446612 1.367261 0.428320 1.429369 0.605094 0.392494 0.284990
PARK2 0.540694 0.500426 0.604097 0.418703 0.000000 0.601280 0.404931 0.552874 0.343271 0.844120 ... 0.755072 1.109400 0.807534 0.586962 0.485122 0.091469 0.596479 0.815242 0.267032 0.645365
AGPAT4 0.095072 5.868557 4.137252 0.066015 0.000000 4.750107 0.069345 4.130618 3.328758 0.000000 ... 0.000000 4.430612 0.000000 0.000000 4.219120 0.171028 2.021294 2.854144 2.139655 2.806291

5 rows × 22 columns

.. code:: python fig, ax = plt.subplots(figsize=(10, 10)) sns.corrplot(data_for_correlations, ax=ax) sns.despine() .. image:: shalek2013_files/shalek2013_123_0.png Notice that this is mostly red, while in the figure from the paper, it was both blue and red. This is because the colormap started at 0.2 (not negative), and was centered with white at about 0.6. I see that they're trying to emphasize how much more correlated the pooled samples are to each other, but I think a simple sequential map would have been more effective. Supplementary Figures 2 and 3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Supplementary `Figure 2 `_ and `Figure 3 `_ are from FISH and raw sequence data, and are out of the scope of this computational reproduction. Supplementary Figure 4 ~~~~~~~~~~~~~~~~~~~~~~ `Supplementary Figure 4 `_ was from published data, however the citation in the Supplementary Information (#23) was a `machine-learning book `_, and #23 in the main text citations was a `review of probabilistic graphical models `_, neither of which have the mouse embryonic stem cells or mouse embryonic fibroblasts used in the figure. Supplementary Figure 5 ~~~~~~~~~~~~~~~~~~~~~~ For this figure, we can only plot 5d, since it's derived directly from a table in their dataset. Warning: these data are going to require some serious cleaning. Yay data janitorial duties! .. figure:: https://raw.githubusercontent.com/olgabot/olgabot.github.io-source/master/content/images/shalek2013_sfig5.png :align: center :alt: Supplementary Figure 5d ^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python barcoded = pd.read_excel('nature12172-s1/Supplementary_Table7.xlsx') barcoded.head() .. raw:: html
TPM Unnamed: 1 Unnamed: 2 Unnamed: 3 Unique Barcodes Unnamed: 5 Unnamed: 6
GENE MB_S1 MB_S2 MB_S3 NaN MB_S1 MB_S2 MB_S3
0610007L01RIK 0 0 5.595054 NaN 0 0 0
0610007P14RIK 76.25091 38.77614 0.1823286 NaN 23 8 0
0610007P22RIK 24.26729 50.24694 17.74422 NaN 14 5 6
0610008F07RIK 0 0 0 NaN 0 0 0
The first three columns are TPM calculated from the three samples that have molecular barcodes, and the last three columns are the integer counts of molecular barcodes from the three molecular barcode samples. Let's remove the "Unnamed: 3" column which is all NaNs. We'll do that with the ``.dropna`` method, specifying ``axis=1`` for columns and ``how="all"`` to make sure only columns that have ALL NaNs are removed. .. code:: python barcoded = barcoded.dropna(how='all', axis=1) barcoded.head() .. raw:: html
TPM Unnamed: 1 Unnamed: 2 Unique Barcodes Unnamed: 5 Unnamed: 6
GENE MB_S1 MB_S2 MB_S3 MB_S1 MB_S2 MB_S3
0610007L01RIK 0 0 5.595054 0 0 0
0610007P14RIK 76.25091 38.77614 0.1823286 23 8 0
0610007P22RIK 24.26729 50.24694 17.74422 14 5 6
0610008F07RIK 0 0 0 0 0 0
Next, let's drop that pesky "GENE" row. Don't worry, we'll get the sample ID names back next. .. code:: python barcoded = barcoded.drop('GENE', axis=0) barcoded.head() .. raw:: html
TPM Unnamed: 1 Unnamed: 2 Unique Barcodes Unnamed: 5 Unnamed: 6
0610007L01RIK 0 0 5.595054 0 0 0
0610007P14RIK 76.25091 38.77614 0.1823286 23 8 0
0610007P22RIK 24.26729 50.24694 17.74422 14 5 6
0610008F07RIK 0 0 0 0 0 0
0610009B22RIK 67.12981 115.1393 55.98812 11 18 8
We'll create a ``pandas.MultiIndex`` from the tuples of ``(sample_id, measurement_type)`` pair. .. code:: python columns = pd.MultiIndex.from_tuples([('MB_S1', 'TPM'), ('MB_S2', 'TPM'), ('MB_S3', 'TPM'), ('MB_S1', 'Unique Barcodes'), ('MB_S2', 'Unique Barcodes'), ('MB_S3', 'Unique Barcodes')]) barcoded.columns = columns barcoded = barcoded.sort_index(axis=1) barcoded.head() .. raw:: html
MB_S1 MB_S2 MB_S3
TPM Unique Barcodes TPM Unique Barcodes TPM Unique Barcodes
0610007L01RIK 0 0 0 0 5.595054 0
0610007P14RIK 76.25091 23 38.77614 8 0.1823286 0
0610007P22RIK 24.26729 14 50.24694 5 17.74422 6
0610008F07RIK 0 0 0 0 0 0
0610009B22RIK 67.12981 11 115.1393 18 55.98812 8
For the next move, we're going to do some crazy ``pandas``-fu. First we're going to transpose, then ``reset_index`` of the transpose. Just so you know what this looks like, it's this. .. code:: python barcoded.T.reset_index().head() .. raw:: html
level_0 level_1 0610007L01RIK 0610007P14RIK 0610007P22RIK 0610008F07RIK 0610009B22RIK 0610009D07RIK 0610009O20RIK 0610010B08RIK ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
0 MB_S1 TPM 0 76.25091 24.26729 0 67.12981 132.2392 17.03907 0.01375923 ... 0 206.8494 0 0 0 0 0.01985733 55.28996 0.09482778 0
1 MB_S1 Unique Barcodes 0 23 14 0 11 29 3 1 ... 0 33 0 0 0 0 0 6 0 0
2 MB_S2 TPM 0 38.77614 50.24694 0 115.1393 49.16287 0 0 ... 0 48.7729 0 0 0 0 7.894789 135.1977 0 4.272594
3 MB_S2 Unique Barcodes 0 8 5 0 18 11 0 0 ... 0 10 0 0 0 0 0 7 0 0
4 MB_S3 TPM 5.595054 0.1823286 17.74422 0 55.98812 203.6302 0 0.4914763 ... 0 54.51386 1.120081 0 0 0 0.1238624 340.7358 0.6677646 0

5 rows × 27725 columns

Next, we're going to transform the data into a `tidy `_ format, with separate columns for sample ids, measurement types, the gene that was measured, and its measurement value. .. code:: python barcoded_tidy = pd.melt(barcoded.T.reset_index(), id_vars=['level_0', 'level_1']) barcoded_tidy.head() .. raw:: html
level_0 level_1 variable value
0 MB_S1 TPM 0610007L01RIK 0
1 MB_S1 Unique Barcodes 0610007L01RIK 0
2 MB_S2 TPM 0610007L01RIK 0
3 MB_S2 Unique Barcodes 0610007L01RIK 0
4 MB_S3 TPM 0610007L01RIK 5.595054
Now let's rename these columns into something more useful, instead of "level\_0" .. code:: python barcoded_tidy = barcoded_tidy.rename(columns={'level_0': 'sample_id', 'level_1': 'measurement', 'variable': 'gene_name'}) barcoded_tidy.head() .. raw:: html
sample_id measurement gene_name value
0 MB_S1 TPM 0610007L01RIK 0
1 MB_S1 Unique Barcodes 0610007L01RIK 0
2 MB_S2 TPM 0610007L01RIK 0
3 MB_S2 Unique Barcodes 0610007L01RIK 0
4 MB_S3 TPM 0610007L01RIK 5.595054
Next, we're going to take some seemingly-duplicating steps, but trust me, it'll make the data easier. .. code:: python barcoded_tidy['TPM'] = barcoded_tidy.value[barcoded_tidy.measurement == 'TPM'] barcoded_tidy['Unique Barcodes'] = barcoded_tidy.value[barcoded_tidy.measurement == 'Unique Barcodes'] Fill the values of the "**TPM**"'s forwards, since they appear first, and fill the values of the "**Unique Barcodes**" backwards, since they're second .. code:: python barcoded_tidy.TPM = barcoded_tidy.TPM.ffill() barcoded_tidy['Unique Barcodes'] = barcoded_tidy['Unique Barcodes'].bfill() barcoded_tidy.head() .. raw:: html
sample_id measurement gene_name value TPM Unique Barcodes
0 MB_S1 TPM 0610007L01RIK 0 0.000000 0
1 MB_S1 Unique Barcodes 0610007L01RIK 0 0.000000 0
2 MB_S2 TPM 0610007L01RIK 0 0.000000 0
3 MB_S2 Unique Barcodes 0610007L01RIK 0 0.000000 0
4 MB_S3 TPM 0610007L01RIK 5.595054 5.595054 0
Drop the "**measurement**" column and drop duplicate rows. .. code:: python barcoded_tidy = barcoded_tidy.drop('measurement', axis=1) barcoded_tidy = barcoded_tidy.drop_duplicates() barcoded_tidy.head() .. raw:: html
sample_id gene_name value TPM Unique Barcodes
0 MB_S1 0610007L01RIK 0 0.000000 0
2 MB_S2 0610007L01RIK 0 0.000000 0
4 MB_S3 0610007L01RIK 5.595054 5.595054 0
5 MB_S3 0610007L01RIK 0 5.595054 0
6 MB_S1 0610007P14RIK 76.25091 76.250913 23
.. code:: python barcoded_tidy['log TPM'] = np.log(barcoded_tidy.TPM) barcoded_tidy['log Unique Barcodes'] = np.log(barcoded_tidy['Unique Barcodes']) Now we can use the convenient linear model plot (``lmplot``) in ``seaborn`` to plot these three samples together! .. code:: python sns.lmplot('log TPM', 'log Unique Barcodes', barcoded_tidy, col='sample_id') .. parsed-literal:: .. image:: shalek2013_files/shalek2013_153_1.png Supplementary Figures 6-20 ~~~~~~~~~~~~~~~~~~~~~~~~~~ Supplementary Figures `6 `_, `7 `_, `8 `_, `9 `_, `10 `_, `11 `_, `12 `_, `13 `_, `14 `_, `15 `_, `16 `_, `17 `_, `18 `_, `19 `_, and `20 `_, deal with splicing data from the molecular barcodes, RNA-FISH, flow-sorted cells, and single-cell RT-PCR and are out of the scope of this reproduction. Conclusions ----------- While there may be minor, undocumented, differences between the methods presented in the manuscript and the figures, the application of ```flotilla`` `_ presents an opportunity to avoid these types of inconsistencies by strictly documenting every change to code and every transformation of the data. The biology the authors found is clearly real, as they did the knockout experiment of *Ifnr-/-* and saw that indeed the maturation process was affected, and *Stat2* and *Irf7* had much lower expression, as with the "maturing" cells in the data.