Create a barebones datapackage

Before we begin, let’s import everything we need.

# Import the flotilla package for biological data analysis
import flotilla

# Import "numerical python" library for number crunching
import numpy as np

# Import "panel data analysis" library for tabular data
import pandas as pd

Shalek and Sujita, et al (2013)

In the 2013 paper, Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells (Shalek and Sujita, et al. Nature (2013)), Regev and colleagues performed single-cell sequencing 18 bone marrow-derived dendritic cells (BMDCs), in addition to 3 pooled samples.

Expression data

First, we will read in the expression data. These data were obtained using,

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz

We will also compare to the supplementary table 2 data, obtained using

wget http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip
unzip nature12172-s1.zip
expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0)
expression.head()
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)

<ipython-input-2-b45e9120e279> in <module>()
----> 1 expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0)
      2 expression.head()


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473
--> 474         return _read(filepath_or_buffer, kwds)
    475
    476     parser_f.__name__ = name


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    248
    249     # Create the parser.
--> 250     parser = TextFileReader(filepath_or_buffer, **kwds)
    251
    252     if (nrows is not None) and (chunksize is not None):


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    564             self.options['has_index_names'] = kwds['has_index_names']
    565
--> 566         self._make_engine(self.engine)
    567
    568     def _get_options_with_defaults(self, engine):


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    703     def _make_engine(self, engine='c'):
    704         if engine == 'c':
--> 705             self._engine = CParserWrapper(self.f, **self.options)
    706         else:
    707             if engine == 'python':


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1070         kwds['allow_leading_cols'] = self.index_col is not False
   1071
-> 1072         self._reader = _parser.TextReader(src, **kwds)
   1073
   1074         # XXX


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)()


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5468)()


/home/travis/miniconda/envs/testenv/lib/python2.7/gzip.pyc in __init__(self, filename, mode, compresslevel, fileobj, mtime)
     92             mode += 'b'
     93         if fileobj is None:
---> 94             fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
     95         if filename is None:
     96             # Issue #13781: os.fdopen() creates a fileobj with a bogus name


IOError: [Errno 2] No such file or directory: 'GSE41265_allGenesTPM.txt.gz'

These data are in the “transcripts per million,” aka TPM unit. See this blog post if that sounds weird to you.

These data are formatted with samples on the columns, and genes on the rows. But we want the opposite, with samples on the rows and genes on the columns. This follows `scikit-learn <http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-an-example-dataset>`_’s standard of data matrices with size (n_samples, n_features) as each gene is a feature. So we will simply transpose this.

expression = expression.T
expression.head()
GENE XKR4 AB338584 B3GAT2 NPL T2 T PDE10A 1700010I14RIK 6530411M01RIK PABPC6 ... AK085062 DHX9 RNASET2B FGFR1OP CCR6 BRP44L AK014435 AK015714 SFT2D1 PRR18
S1 0 0 0.000000 72.008590 0.109249 0 0 0 0 0 ... 0 0.774638 23.520936 0.000000 0 460.316773 0 0.000000 39.442566 0
S2 0 0 0.000000 0.000000 0.172009 0 0 0 0 0 ... 0 0.367391 1.887873 0.000000 0 823.890290 0 0.000000 4.967412 0
S3 0 0 0.023441 128.062012 0.000000 0 0 0 0 0 ... 0 0.249858 0.313510 0.166772 0 1002.354241 0 0.000000 0.000000 0
S4 0 0 0.000000 0.095082 0.000000 0 0 0 0 0 ... 0 0.354157 0.000000 0.887003 0 1230.766795 0 0.000000 0.131215 0
S5 0 0 0.000000 0.000000 0.182703 0 0 0 0 0 ... 0 0.039263 0.000000 131.077131 0 1614.749122 0 0.242179 95.485743 0

5 rows × 27723 columns

The authors filtered the expression data based on having at least 3 single cells express genes with at TPM (transcripts per million, ) > 1. We can express this in using the `pandas <http://pandas.pydata.org>`_ DataFrames easily.

First, from reading the paper and looking at the data, I know there are 18 single cells, and there are 18 samples that start with the letter “S.” So I will extract the single samples from the index (row names) using a lambda, a tiny function which in this case, tells me whether or not that sample id begins with the letter “S”.

singles_ids = expression.index[expression.index.map(lambda x: x.startswith('S'))]
print('number of single cells:', len(singles_ids))
singles = expression.ix[singles_ids]

expression_filtered = expression.ix[:, singles[singles > 1].count() >= 3]
expression_filtered = np.log(expression_filtered + 1)
expression_filtered.shape
('number of single cells:', 18)
(21, 6312)

Hmm, that’s strange. The paper states that they had 6313 genes after filtering, but I get 6312. Even using “singles >= 1” doesn’t help.

(I also tried this with the expression table provided in the supplementary data as “SupplementaryTable2.xlsx,” and got the same results.)

Now that we’ve taken care of importing and filtering the expression data, let’s do the feature data of the expression data.

Expression feature data

This is similar to the fData from BioconductoR, where there’s some additional data on your features that you want to look at. They uploaded information about the features in their OTHER expression matrix, uploaded as a supplementary file, Supplementary_Table2.xlsx.

Notice that this is a csv and not an xlsx. This is because Excel mangled the gene IDS that started with 201* and assumed they were dates :(

The workaround I did was to add another column to the sheet with the formula ="'" & A1, press Command-Shift-End to select the end of the rows, and then do Ctrl-D to “fill down” to the bottom (thanks to this stackoverflow post for teaching me how to Excel). Then, I saved the file as a csv for maximum portability and compatibility.

expression2 = pd.read_csv('nature12172-s1/Supplementary_Table2.csv',
                            # Need to specify the index column as both the first and the last columns,
                            # Because the last column is the "Gene Category"
                            index_col=[0, -1], parse_dates=False, infer_datetime_format=False)

# This was also in features x samples format, so we need to transpose
expression2 = expression2.T
expression2.head()
'GENE '0610007L01RIK '0610007P14RIK '0610007P22RIK '0610008F07RIK '0610009B22RIK '0610009D07RIK '0610009O20RIK '0610010B08RIK '0610010F05RIK '0610010K06RIK ... 'ZWILCH 'ZWINT 'ZXDA 'ZXDB 'ZXDC 'ZYG11A 'ZYG11B 'ZYX 'ZZEF1 'ZZZ3
Gene Category NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S1 27.181570 0.166794 0 0 0.000000 178.852732 0 0.962417 0.000000 143.359550 ... 0.000000 302.361227 0.000000 0 0 0 0.027717 297.918756 37.685501 0.000000
S2 37.682691 0.263962 0 0 0.207921 0.141099 0 0.000000 0.000000 0.255617 ... 0.000000 96.033724 0.020459 0 0 0 0.042430 0.242888 0.000000 0.000000
S3 0.056916 78.622459 0 0 0.145680 0.396363 0 0.000000 0.024692 72.775846 ... 0.000000 427.915555 0.000000 0 0 0 0.040407 6.753530 0.132011 0.017615
S4 55.649250 0.228866 0 0 0.000000 88.798158 0 0.000000 0.000000 93.825442 ... 0.000000 9.788557 0.017787 0 0 0 0.013452 0.274689 9.724890 0.000000
S5 0.000000 0.093117 0 0 131.326008 155.936361 0 0.000000 0.000000 0.031029 ... 0.204522 26.575760 0.000000 0 0 0 1.101589 59.256094 44.430726 0.000000

5 rows × 27723 columns

Now we need to strip the single-quote I added to all the gene names:

new_index, indexer = expression2.columns.reindex(map(lambda x: (x[0].lstrip("'"), x[1]), expression2.columns.values))
expression2.columns = new_index
expression2.head()
'GENE 0610007L01RIK 0610007P14RIK 0610007P22RIK 0610008F07RIK 0610009B22RIK 0610009D07RIK 0610009O20RIK 0610010B08RIK 0610010F05RIK 0610010K06RIK ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
Gene Category NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S1 27.181570 0.166794 0 0 0.000000 178.852732 0 0.962417 0.000000 143.359550 ... 0.000000 302.361227 0.000000 0 0 0 0.027717 297.918756 37.685501 0.000000
S2 37.682691 0.263962 0 0 0.207921 0.141099 0 0.000000 0.000000 0.255617 ... 0.000000 96.033724 0.020459 0 0 0 0.042430 0.242888 0.000000 0.000000
S3 0.056916 78.622459 0 0 0.145680 0.396363 0 0.000000 0.024692 72.775846 ... 0.000000 427.915555 0.000000 0 0 0 0.040407 6.753530 0.132011 0.017615
S4 55.649250 0.228866 0 0 0.000000 88.798158 0 0.000000 0.000000 93.825442 ... 0.000000 9.788557 0.017787 0 0 0 0.013452 0.274689 9.724890 0.000000
S5 0.000000 0.093117 0 0 131.326008 155.936361 0 0.000000 0.000000 0.031029 ... 0.204522 26.575760 0.000000 0 0 0 1.101589 59.256094 44.430726 0.000000

5 rows × 27723 columns

We want to create a pandas.DataFrame from the “Gene Category” row for our expression_feature_data, which we will do via:

gene_ids, gene_category = zip(*expression2.columns.values)
gene_categories = pd.Series(gene_category, index=gene_ids, name='gene_category')
gene_categories
0610007L01RIK    NaN
0610007P14RIK    NaN
0610007P22RIK    NaN
0610008F07RIK    NaN
0610009B22RIK    NaN
0610009D07RIK    NaN
0610009O20RIK    NaN
0610010B08RIK    NaN
0610010F05RIK    NaN
0610010K06RIK    NaN
0610010K14RIK    NaN
0610010O12RIK    NaN
0610011F06RIK    NaN
0610011L14RIK    NaN
0610012G03RIK    NaN
...
ZSWIM5             NaN
ZSWIM6             NaN
ZSWIM7             NaN
ZUFSP     LPS Response
ZW10               NaN
ZWILCH             NaN
ZWINT              NaN
ZXDA               NaN
ZXDB               NaN
ZXDC               NaN
ZYG11A             NaN
ZYG11B             NaN
ZYX                NaN
ZZEF1              NaN
ZZZ3               NaN
Name: gene_category, Length: 27723, dtype: object
expression_feature_data = pd.DataFrame(gene_categories)
expression_feature_data.head()
gene_category
0610007L01RIK NaN
0610007P14RIK NaN
0610007P22RIK NaN
0610008F07RIK NaN
0610009B22RIK NaN

Splicing Data

We obtain the splicing data from this study from the supplementary information, specifically the Supplementary_Table4.xls

splicing = pd.read_excel('nature12172-s1/Supplementary_Table4.xls', 'splicingTable.txt', index_col=(0,1))
splicing.head()
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S13 S14 S15 S16 S17 S18 10,000 cell Rep1 (P1) 10,000 cell Rep2 (P2) 10,000 cell Rep3 (P3)
Event name gene
chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- Os9 0.84 NaN NaN NaN NaN 0.01 NaN NaN NaN NaN 0.03 NaN NaN 0.02 NaN 0.01 NaN 0.27 0.37 0.31
chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- Vta1 0.95 NaN NaN 0.84 0.95 0.91 0.87 0.86 NaN NaN 0.93 NaN NaN 0.96 NaN NaN NaN 0.83 0.85 0.64
chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ Bclaf1 NaN 0.04 0.02 NaN NaN 0.14 NaN 0.02 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN 0.40 0.49 0.59
chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ Bclaf1 0.02 0.98 0.55 NaN NaN NaN NaN 0.98 NaN NaN NaN NaN NaN 0.06 NaN NaN NaN 0.62 0.63 0.70
chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ P4ha1 0.42 NaN NaN NaN 0.94 NaN NaN 0.03 0.97 NaN NaN NaN NaN NaN NaN NaN NaN 0.43 0.36 0.52
splicing = splicing.T
splicing
Event name chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ ... chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene Os9 Vta1 Bclaf1 Bclaf1 P4ha1 Bsg Ptbp1 Igf1 Elk3 Nbr1 ... Afg3l1 Tep1 Fgd2 Ttc17 Tmed1 Tmed1 Sbno2 Synj2 Tbc1d13 Usp47
S1 0.84 0.95 NaN 0.02 0.42 NaN 0.57 0.31 0.93 0.57 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S2 NaN NaN 0.04 0.98 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S3 NaN NaN 0.02 0.55 NaN NaN NaN 0.20 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S4 NaN 0.84 NaN NaN NaN NaN NaN 0.95 NaN 0.04 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S5 NaN 0.95 NaN NaN 0.94 NaN NaN 0.73 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S6 0.01 0.91 0.14 NaN NaN NaN NaN 0.61 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S7 NaN 0.87 NaN NaN NaN 0.62 NaN 0.85 0.73 0.55 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S8 NaN 0.86 0.02 0.98 0.03 NaN NaN 0.89 0.82 0.83 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S9 NaN NaN NaN NaN 0.97 NaN 0.97 NaN 0.90 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S10 NaN NaN NaN NaN NaN NaN 0.06 0.98 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S11 0.03 0.93 NaN NaN NaN NaN NaN NaN 0.97 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S14 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.88 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S15 0.02 0.96 0.01 0.06 NaN NaN NaN 0.44 NaN NaN ... 0.91 NaN NaN NaN NaN NaN NaN NaN NaN NaN
S16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.27 0.99 0.99 0.98 0.98 NaN NaN NaN NaN
S17 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 0.96 NaN NaN NaN 0.99 0.98 0.67 0.07
S18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10,000 cell Rep1 (P1) 0.27 0.83 0.40 0.62 0.43 0.78 NaN 0.60 0.76 0.52 ... 0.92 NaN 0.81 0.77 NaN NaN 0.84 0.50 0.56 NaN
10,000 cell Rep2 (P2) 0.37 0.85 0.49 0.63 0.36 0.72 0.47 0.60 0.73 0.68 ... 0.67 0.15 0.52 0.67 0.63 0.73 0.82 0.90 0.71 0.55
10,000 cell Rep3 (P3) 0.31 0.64 0.59 0.70 0.52 0.79 NaN 0.65 0.42 0.64 ... 0.58 0.79 0.74 0.85 0.73 0.39 0.56 NaN 0.64 NaN

20 rows × 352 columns

The three pooled samples aren’t named consistently with the expression data, so we have to fix that.

splicing.index[splicing.index.map(lambda x: 'P' in x)]
Index([u'10,000 cell Rep1 (P1)', u'10,000 cell Rep2 (P2)', u'10,000 cell Rep3 (P3)'], dtype='object')

Since the pooled sample IDs are inconsistent with the expression data, we have to change them. We can get the “P” and the number after that using regular expressions, called re in the Python standard library, e.g.:

import re
re.search(r'P\d', '10,000 cell Rep1 (P1)').group()
'P1'
def long_pooled_name_to_short(x):
    if 'P' not in x:
        return x
    else:
        return re.search(r'P\d', x).group()


splicing.index.map(long_pooled_name_to_short)
array([u'S1', u'S2', u'S3', u'S4', u'S5', u'S6', u'S7', u'S8', u'S9',
       u'S10', u'S11', u'S13', u'S14', u'S15', u'S16', u'S17', u'S18',
       u'P1', u'P2', u'P3'], dtype=object)

And now we assign this new index as our index to the splicing dataframe

splicing.index = splicing.index.map(long_pooled_name_to_short)
splicing.head()
Event name chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ ... chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene Os9 Vta1 Bclaf1 Bclaf1 P4ha1 Bsg Ptbp1 Igf1 Elk3 Nbr1 ... Afg3l1 Tep1 Fgd2 Ttc17 Tmed1 Tmed1 Sbno2 Synj2 Tbc1d13 Usp47
S1 0.84 0.95 NaN 0.02 0.42 NaN 0.57 0.31 0.93 0.57 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S2 NaN NaN 0.04 0.98 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S3 NaN NaN 0.02 0.55 NaN NaN NaN 0.20 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S4 NaN 0.84 NaN NaN NaN NaN NaN 0.95 NaN 0.04 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S5 NaN 0.95 NaN NaN 0.94 NaN NaN 0.73 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 352 columns

Metadata

Now let’s get into creating a metadata dataframe. We’ll use the index from the expression_filtered data to create the minimum required column, 'phenotype', which has the name of the phenotype of that cell. And we’ll also add the column 'pooled' to indicate whether this sample is pooled or not.

metadata = pd.DataFrame(index=expression_filtered.index)
metadata['phenotype'] = 'BDMC'
metadata['pooled'] = metadata.index.map(lambda x: x.startswith('P'))

metadata
phenotype pooled
S1 BDMC False
S2 BDMC False
S3 BDMC False
S4 BDMC False
S5 BDMC False
S6 BDMC False
S7 BDMC False
S8 BDMC False
S9 BDMC False
S10 BDMC False
S11 BDMC False
S12 BDMC False
S13 BDMC False
S14 BDMC False
S15 BDMC False
S16 BDMC False
S17 BDMC False
S18 BDMC False
P1 BDMC True
P2 BDMC True
P3 BDMC True

Mapping stats data

mapping_stats = pd.read_excel('nature12172-s1/Supplementary_Table1.xls', sheetname='SuppTable1 2.txt')
mapping_stats
Sample PF_READS PCT_MAPPED_GENOME PCT_RIBOSOMAL_BASES MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS MEDIAN_5PRIME_TO_3PRIME_BIAS
0 S1 21326048 0.706590 0.006820 0.509939 0.092679 0.477321 0.247741
1 S2 27434011 0.745385 0.004111 0.565732 0.056583 0.321053 0.244062
2 S3 31142391 0.722087 0.006428 0.540341 0.079551 0.382286 0.267367
3 S4 26231852 0.737854 0.004959 0.530978 0.067041 0.351670 0.279782
4 S5 29977214 0.746466 0.006121 0.525598 0.066543 0.353995 0.274252
5 S6 24148387 0.730079 0.008794 0.529650 0.072095 0.413696 0.225929
6 S7 24078116 0.730638 0.007945 0.540913 0.051991 0.358597 0.201984
7 S8 25032126 0.739989 0.004133 0.512725 0.058783 0.373509 0.212337
8 S9 22257682 0.747427 0.004869 0.521622 0.063566 0.334294 0.240641
9 S10 29436289 0.748795 0.005499 0.560454 0.036219 0.306729 0.187479
10 S11 31130278 0.741882 0.002740 0.558882 0.049581 0.349191 0.211787
11 S12 21161595 0.750782 0.006837 0.756339 0.013878 0.324264 0.195430
12 S13 28612833 0.733976 0.011718 0.598687 0.035392 0.357447 0.198566
13 S14 26351189 0.748323 0.004106 0.517518 0.070293 0.381095 0.259122
14 S15 25739575 0.748421 0.003353 0.526238 0.050938 0.324207 0.212366
15 S16 26802346 0.739833 0.009370 0.520287 0.071503 0.358758 0.240009
16 S17 26343522 0.749358 0.003155 0.673195 0.024121 0.301588 0.245854
17 S18 25290073 0.749358 0.007465 0.562382 0.048528 0.314776 0.215160
18 10k_rep1 28247826 0.688553 0.018993 0.547000 0.056113 0.484393 0.140333
19 10k_rep2 39303876 0.690313 0.017328 0.547621 0.055600 0.474634 0.142889
20 10k_rep3 29831281 0.710875 0.010610 0.518053 0.066053 0.488738 0.168180
21 MB_SC1 13848219 0.545000 0.007000 0.531495 0.127934 0.207841 0.728980
22 MB_SC2 13550218 0.458000 0.010800 0.569271 0.102581 0.179407 0.694747
23 MB_SC3 26765848 0.496000 0.007900 0.535192 0.141893 0.231068 0.722080

Create a flotilla Study!

study = flotilla.Study(# The metadata describing phenotype and pooled samples
                       metadata,

                       # A version for this data
                       version='0.1.0',

                       # Dataframe of the filtered expression data
                       expression_data=expression_filtered,

                       # Dataframe of the feature data of the genes
                       expression_feature_data=expression_feature_data,

                       # Dataframe of the splicing data
                       splicing_data=splicing,

                       # Dataframe of the mapping stats data
                       mapping_stats_data=mapping_stats,

                       # Which column in "mapping_stats" has the number of reads
                       mapping_stats_number_mapped_col='PF_READS')
2014-11-04 14:19:33 Initializing Study
2014-11-04 14:19:33 Initializing Predictor configuration manager for Study
2014-11-04 14:19:33 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2014-11-04 14:19:33 Added ExtraTreesClassifier to default predictors
2014-11-04 14:19:33 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2014-11-04 14:19:33 Added ExtraTreesRegressor to default predictors
2014-11-04 14:19:33 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2014-11-04 14:19:33 Added GradientBoostingClassifier to default predictors
2014-11-04 14:19:33 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2014-11-04 14:19:33 Added GradientBoostingRegressor to default predictors
2014-11-04 14:19:33 Loading metadata
2014-11-04 14:19:33 Loading expression data
2014-11-04 14:19:33 Initializing expression
2014-11-04 14:19:33 Done initializing expression
2014-11-04 14:19:33 Loading splicing data
2014-11-04 14:19:33 Initializing splicing
2014-11-04 14:19:33 Done initializing splicing
2014-11-04 14:19:33 Successfully initialized a Study object!
No phenotype to color mapping was provided, so coming up with reasonable defaults
No phenotype to marker (matplotlib plotting symbol) was provided, so each phenotype will be plotted as a circle in the PCA visualizations.
samples had too few mapped reads (<5.0e+05 reads):

Removing technical outliers from consideration in expression:

Removing technical outliers from consideration in splicing:

As a side note, you can save this study to disk now, so you can “embark” later:

study.save('shalek2013')
Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json

Note that this is saved to my home directory, in ~/flotilla_projects/<study_name>/. This will be saved in your home directory, too.

The datapackage.json file is what holds all the information relative to the study, and loosely follows the datapackage spec created by the Open Knowledge Foundation.

cat /Users/olga/flotilla_projects/shalek2013/datapackage.json
{
  "name": "shalek2013",
  "title": null,
  "datapackage_version": "0.1.1",
  "sources": null,
  "licenses": null,
  "resources": [
    {
      "path": "/Users/olga/flotilla_projects/shalek2013/splicing.csv.gz",
      "format": "csv",
      "name": "splicing",
      "compression": "gzip"
    },
    {
      "number_mapped_col": "PF_READS",
      "path": "/Users/olga/flotilla_projects/shalek2013/mapping_stats.csv.gz",
      "format": "csv",
      "name": "mapping_stats",
      "compression": "gzip"
    },
    {
      "name": "expression_feature",
      "format": "csv",
      "rename_col": null,
      "ignore_subset_cols": [],
      "path": "/Users/olga/flotilla_projects/shalek2013/expression_feature.csv.gz",
      "compression": "gzip"
    },
    {
      "name": "expression",
      "log_base": null,
      "format": "csv",
      "thresh": -Infinity,
      "path": "/Users/olga/flotilla_projects/shalek2013/expression.csv.gz",
      "compression": "gzip"
    },
    {
      "name": "splicing_feature",
      "format": "csv",
      "rename_col": "gene_name",
      "ignore_subset_cols": [],
      "path": "/Users/olga/flotilla_projects/shalek2013/splicing_feature.csv.gz",
      "compression": "gzip"
    },
    {
      "pooled_col": "pooled",
      "name": "metadata",
      "phenotype_to_marker": {
        "BDMC": "o"
      },
      "format": "csv",
      "minimum_samples": 0,
      "phenotype_to_color": {
        "BDMC": "#1b9e77"
      },
      "path": "/Users/olga/flotilla_projects/shalek2013/metadata.csv.gz",
      "phenotype_col": "phenotype",
      "phenotype_order": [
        "BDMC"
      ],
      "compression": "gzip"
    }
  ]
}

One thing to note is that when you save, the version number is bumped up. study.version (the one we just made) is 0.1.0, but the one we saved is 0.1.1, since we could have made some changes to the data.

Let’s look at what else is in this folder:

ls /Users/olga/flotilla_projects/shalek2013
datapackage.json           expression_feature.csv.gz  metadata.csv.gz            splicing_feature.csv.gz
expression.csv.gz          mapping_stats.csv.gz       splicing.csv.gz

So this is where all the other files are. Good to know!

We can “embark” on this newly-saved study now very painlessly, without having to open and process all those files again:

study2 = flotilla.embark('shalek2013')
2014-10-30 12:16:56 Reading datapackage from /Users/olga/flotilla_projects/shalek2013/datapackage.json
2014-10-30 12:16:56 Parsing datapackage to create a Study object
2014-10-30 12:16:56 Initializing Study
2014-10-30 12:16:56 Initializing Predictor configuration manager for Study
2014-10-30 12:16:56 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2014-10-30 12:16:56 Added ExtraTreesClassifier to default predictors
2014-10-30 12:16:56 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2014-10-30 12:16:56 Added ExtraTreesRegressor to default predictors
2014-10-30 12:16:56 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2014-10-30 12:16:56 Added GradientBoostingClassifier to default predictors
2014-10-30 12:16:56 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2014-10-30 12:16:56 Added GradientBoostingRegressor to default predictors
2014-10-30 12:16:56 Loading metadata
2014-10-30 12:16:56 Loading expression data
2014-10-30 12:16:56 Initializing expression
2014-10-30 12:16:56 Done initializing expression
2014-10-30 12:16:56 Loading splicing data
2014-10-30 12:16:56 Initializing splicing
2014-10-30 12:16:56 Done initializing splicing
2014-10-30 12:16:56 Successfully initialized a Study object!
Olga B. Botvinnik is funded by the NDSEG fellowship and is a NumFOCUS John Hunter Technology Fellow.
Michael T. Lovci was partially funded by a fellowship from Genentech.
Partially funded by NIH grants NS075449 and HG004659 and CIRM grants RB4-06045 and TR3-05676 to Gene Yeo.