Create a barebones datapackage — flotilla 0.2.8dev documentation

Shalek and Sujita, et al (2013)¶

In the 2013 paper, Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells (Shalek and Sujita, et al. Nature (2013)), Regev and colleagues performed single-cell sequencing 18 bone marrow-derived dendritic cells (BMDCs), in addition to 3 pooled samples.

Expression data¶

First, we will read in the expression data. These data were obtained using,

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz

We will also compare to the supplementary table 2 data, obtained using

wget http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip
unzip nature12172-s1.zip

expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0)
expression.head()

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)

<ipython-input-2-b45e9120e279> in <module>()
----> 1 expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0)
      2 expression.head()


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473
--> 474         return _read(filepath_or_buffer, kwds)
    475
    476     parser_f.__name__ = name


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    248
    249     # Create the parser.
--> 250     parser = TextFileReader(filepath_or_buffer, **kwds)
    251
    252     if (nrows is not None) and (chunksize is not None):


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    564             self.options['has_index_names'] = kwds['has_index_names']
    565
--> 566         self._make_engine(self.engine)
    567
    568     def _get_options_with_defaults(self, engine):


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    703     def _make_engine(self, engine='c'):
    704         if engine == 'c':
--> 705             self._engine = CParserWrapper(self.f, **self.options)
    706         else:
    707             if engine == 'python':


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1070         kwds['allow_leading_cols'] = self.index_col is not False
   1071
-> 1072         self._reader = _parser.TextReader(src, **kwds)
   1073
   1074         # XXX


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)()


/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5468)()


/home/travis/miniconda/envs/testenv/lib/python2.7/gzip.pyc in __init__(self, filename, mode, compresslevel, fileobj, mtime)
     92             mode += 'b'
     93         if fileobj is None:
---> 94             fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
     95         if filename is None:
     96             # Issue #13781: os.fdopen() creates a fileobj with a bogus name


IOError: [Errno 2] No such file or directory: 'GSE41265_allGenesTPM.txt.gz'

These data are in the “transcripts per million,” aka TPM unit. See this blog post if that sounds weird to you.

These data are formatted with samples on the columns, and genes on the rows. But we want the opposite, with samples on the rows and genes on the columns. This follows `scikit-learn <http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-an-example-dataset>`_’s standard of data matrices with size (n_samples, n_features) as each gene is a feature. So we will simply transpose this.

expression = expression.T
expression.head()

GENE	B3GAT2	NPL	T2	...	DHX9	RNASET2B	FGFR1OP	BRP44L	AK015714	SFT2D1
S1	0.000000	72.008590	0.109249	...	0.774638	23.520936	0.000000	460.316773	0.000000	39.442566
S2	0.000000	0.000000	0.172009	...	0.367391	1.887873	0.000000	823.890290	0.000000	4.967412
S3	0.023441	128.062012	0.000000	...	0.249858	0.313510	0.166772	1002.354241	0.000000	0.000000
S4	0.000000	0.095082	0.000000	...	0.354157	0.000000	0.887003	1230.766795	0.000000	0.131215
S5	0.000000	0.000000	0.182703	...	0.039263	0.000000	131.077131	1614.749122	0.242179	95.485743

5 rows × 27723 columns

The authors filtered the expression data based on having at least 3 single cells express genes with at TPM (transcripts per million, ) > 1. We can express this in using the `pandas <http://pandas.pydata.org>`_ DataFrames easily.

First, from reading the paper and looking at the data, I know there are 18 single cells, and there are 18 samples that start with the letter “S.” So I will extract the single samples from the index (row names) using a lambda, a tiny function which in this case, tells me whether or not that sample id begins with the letter “S”.

singles_ids = expression.index[expression.index.map(lambda x: x.startswith('S'))]
print('number of single cells:', len(singles_ids))
singles = expression.ix[singles_ids]

expression_filtered = expression.ix[:, singles[singles > 1].count() >= 3]
expression_filtered = np.log(expression_filtered + 1)
expression_filtered.shape

('number of single cells:', 18)

(21, 6312)

Hmm, that’s strange. The paper states that they had 6313 genes after filtering, but I get 6312. Even using “singles >= 1” doesn’t help.

(I also tried this with the expression table provided in the supplementary data as “SupplementaryTable2.xlsx,” and got the same results.)

Now that we’ve taken care of importing and filtering the expression data, let’s do the feature data of the expression data.

Expression feature data¶

This is similar to the fData from BioconductoR, where there’s some additional data on your features that you want to look at. They uploaded information about the features in their OTHER expression matrix, uploaded as a supplementary file, Supplementary_Table2.xlsx.

Notice that this is a csv and not an xlsx. This is because Excel mangled the gene IDS that started with 201* and assumed they were dates :(

The workaround I did was to add another column to the sheet with the formula ="'" & A1, press Command-Shift-End to select the end of the rows, and then do Ctrl-D to “fill down” to the bottom (thanks to this stackoverflow post for teaching me how to Excel). Then, I saved the file as a csv for maximum portability and compatibility.

expression2 = pd.read_csv('nature12172-s1/Supplementary_Table2.csv',
                            # Need to specify the index column as both the first and the last columns,
                            # Because the last column is the "Gene Category"
                            index_col=[0, -1], parse_dates=False, infer_datetime_format=False)

# This was also in features x samples format, so we need to transpose
expression2 = expression2.T
expression2.head()

'GENE	'0610007L01RIK	'0610007P14RIK	'0610007P22RIK	'0610008F07RIK	'0610009B22RIK	'0610009D07RIK	'0610009O20RIK	'0610010B08RIK	'0610010F05RIK	'0610010K06RIK	...	'ZWILCH	'ZWINT	'ZXDA	'ZXDB	'ZXDC	'ZYG11A	'ZYG11B	'ZYX	'ZZEF1	'ZZZ3
Gene Category	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S1	27.181570	0.166794	0	0	0.000000	178.852732	0	0.962417	0.000000	143.359550	...	0.000000	302.361227	0.000000	0	0	0	0.027717	297.918756	37.685501	0.000000
S2	37.682691	0.263962	0	0	0.207921	0.141099	0	0.000000	0.000000	0.255617	...	0.000000	96.033724	0.020459	0	0	0	0.042430	0.242888	0.000000	0.000000
S3	0.056916	78.622459	0	0	0.145680	0.396363	0	0.000000	0.024692	72.775846	...	0.000000	427.915555	0.000000	0	0	0	0.040407	6.753530	0.132011	0.017615
S4	55.649250	0.228866	0	0	0.000000	88.798158	0	0.000000	0.000000	93.825442	...	0.000000	9.788557	0.017787	0	0	0	0.013452	0.274689	9.724890	0.000000
S5	0.000000	0.093117	0	0	131.326008	155.936361	0	0.000000	0.000000	0.031029	...	0.204522	26.575760	0.000000	0	0	0	1.101589	59.256094	44.430726	0.000000

5 rows × 27723 columns

Now we need to strip the single-quote I added to all the gene names:

new_index, indexer = expression2.columns.reindex(map(lambda x: (x[0].lstrip("'"), x[1]), expression2.columns.values))
expression2.columns = new_index
expression2.head()

'GENE	0610007L01RIK	0610007P14RIK	0610007P22RIK	0610008F07RIK	0610009B22RIK	0610009D07RIK	0610009O20RIK	0610010B08RIK	0610010F05RIK	0610010K06RIK	...	ZWILCH	ZWINT	ZXDA	ZXDB	ZXDC	ZYG11A	ZYG11B	ZYX	ZZEF1	ZZZ3
Gene Category	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S1	27.181570	0.166794	0	0	0.000000	178.852732	0	0.962417	0.000000	143.359550	...	0.000000	302.361227	0.000000	0	0	0	0.027717	297.918756	37.685501	0.000000
S2	37.682691	0.263962	0	0	0.207921	0.141099	0	0.000000	0.000000	0.255617	...	0.000000	96.033724	0.020459	0	0	0	0.042430	0.242888	0.000000	0.000000
S3	0.056916	78.622459	0	0	0.145680	0.396363	0	0.000000	0.024692	72.775846	...	0.000000	427.915555	0.000000	0	0	0	0.040407	6.753530	0.132011	0.017615
S4	55.649250	0.228866	0	0	0.000000	88.798158	0	0.000000	0.000000	93.825442	...	0.000000	9.788557	0.017787	0	0	0	0.013452	0.274689	9.724890	0.000000
S5	0.000000	0.093117	0	0	131.326008	155.936361	0	0.000000	0.000000	0.031029	...	0.204522	26.575760	0.000000	0	0	0	1.101589	59.256094	44.430726	0.000000

5 rows × 27723 columns

We want to create a pandas.DataFrame from the “Gene Category” row for our expression_feature_data, which we will do via:

gene_ids, gene_category = zip(*expression2.columns.values)
gene_categories = pd.Series(gene_category, index=gene_ids, name='gene_category')
gene_categories

0610007L01RIK    NaN
0610007P14RIK    NaN
0610007P22RIK    NaN
0610008F07RIK    NaN
0610009B22RIK    NaN
0610009D07RIK    NaN
0610009O20RIK    NaN
0610010B08RIK    NaN
0610010F05RIK    NaN
0610010K06RIK    NaN
0610010K14RIK    NaN
0610010O12RIK    NaN
0610011F06RIK    NaN
0610011L14RIK    NaN
0610012G03RIK    NaN
...
ZSWIM5             NaN
ZSWIM6             NaN
ZSWIM7             NaN
ZUFSP     LPS Response
ZW10               NaN
ZWILCH             NaN
ZWINT              NaN
ZXDA               NaN
ZXDB               NaN
ZXDC               NaN
ZYG11A             NaN
ZYG11B             NaN
ZYX                NaN
ZZEF1              NaN
ZZZ3               NaN
Name: gene_category, Length: 27723, dtype: object

expression_feature_data = pd.DataFrame(gene_categories)
expression_feature_data.head()

	gene_category
0610007L01RIK	NaN
0610007P14RIK	NaN
0610007P22RIK	NaN
0610008F07RIK	NaN
0610009B22RIK	NaN

Splicing Data¶

We obtain the splicing data from this study from the supplementary information, specifically the Supplementary_Table4.xls

splicing = pd.read_excel('nature12172-s1/Supplementary_Table4.xls', 'splicingTable.txt', index_col=(0,1))
splicing.head()

		S1	S2	S3	S4	S5	S6	S7	S8	S9	S10	S11	S13	S14	S15	S16	S17	S18	10,000 cell Rep1 (P1)	10,000 cell Rep2 (P2)	10,000 cell Rep3 (P3)
Event name	gene
chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-	Os9	0.84	NaN	NaN	NaN	NaN	0.01	NaN	NaN	NaN	NaN	0.03	NaN	NaN	0.02	NaN	0.01	NaN	0.27	0.37	0.31
chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-	Vta1	0.95	NaN	NaN	0.84	0.95	0.91	0.87	0.86	NaN	NaN	0.93	NaN	NaN	0.96	NaN	NaN	NaN	0.83	0.85	0.64
chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+	Bclaf1	NaN	0.04	0.02	NaN	NaN	0.14	NaN	0.02	NaN	NaN	NaN	NaN	NaN	0.01	NaN	NaN	NaN	0.40	0.49	0.59
chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+	Bclaf1	0.02	0.98	0.55	NaN	NaN	NaN	NaN	0.98	NaN	NaN	NaN	NaN	NaN	0.06	NaN	NaN	NaN	0.62	0.63	0.70
chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+	P4ha1	0.42	NaN	NaN	NaN	0.94	NaN	NaN	0.03	0.97	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.43	0.36	0.52

splicing = splicing.T
splicing

Event name	chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-	chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-	chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+	chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+	chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+	chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+	chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+	chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+	chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:-	chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+	...	chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+	chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:-	chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+	chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:-	chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:-	chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:-	chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:-	chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+	chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+	chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene	Os9	Vta1	Bclaf1	Bclaf1	P4ha1	Bsg	Ptbp1	Igf1	Elk3	Nbr1	...	Afg3l1	Tep1	Fgd2	Ttc17	Tmed1	Tmed1	Sbno2	Synj2	Tbc1d13	Usp47
S1	0.84	0.95	NaN	0.02	0.42	NaN	0.57	0.31	0.93	0.57	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S2	NaN	NaN	0.04	0.98	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S3	NaN	NaN	0.02	0.55	NaN	NaN	NaN	0.20	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S4	NaN	0.84	NaN	NaN	NaN	NaN	NaN	0.95	NaN	0.04	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S5	NaN	0.95	NaN	NaN	0.94	NaN	NaN	0.73	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S6	0.01	0.91	0.14	NaN	NaN	NaN	NaN	0.61	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S7	NaN	0.87	NaN	NaN	NaN	0.62	NaN	0.85	0.73	0.55	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S8	NaN	0.86	0.02	0.98	0.03	NaN	NaN	0.89	0.82	0.83	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S9	NaN	NaN	NaN	NaN	0.97	NaN	0.97	NaN	0.90	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S10	NaN	NaN	NaN	NaN	NaN	NaN	0.06	0.98	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S11	0.03	0.93	NaN	NaN	NaN	NaN	NaN	NaN	0.97	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.88	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S15	0.02	0.96	0.01	0.06	NaN	NaN	NaN	0.44	NaN	NaN	...	0.91	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	0.27	0.99	0.99	0.98	0.98	NaN	NaN	NaN	NaN
S17	0.01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	0.96	NaN	NaN	NaN	0.99	0.98	0.67	0.07
S18	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10,000 cell Rep1 (P1)	0.27	0.83	0.40	0.62	0.43	0.78	NaN	0.60	0.76	0.52	...	0.92	NaN	0.81	0.77	NaN	NaN	0.84	0.50	0.56	NaN
10,000 cell Rep2 (P2)	0.37	0.85	0.49	0.63	0.36	0.72	0.47	0.60	0.73	0.68	...	0.67	0.15	0.52	0.67	0.63	0.73	0.82	0.90	0.71	0.55
10,000 cell Rep3 (P3)	0.31	0.64	0.59	0.70	0.52	0.79	NaN	0.65	0.42	0.64	...	0.58	0.79	0.74	0.85	0.73	0.39	0.56	NaN	0.64	NaN

20 rows × 352 columns

The three pooled samples aren’t named consistently with the expression data, so we have to fix that.

splicing.index[splicing.index.map(lambda x: 'P' in x)]

Index([u'10,000 cell Rep1 (P1)', u'10,000 cell Rep2 (P2)', u'10,000 cell Rep3 (P3)'], dtype='object')

Since the pooled sample IDs are inconsistent with the expression data, we have to change them. We can get the “P” and the number after that using regular expressions, called re in the Python standard library, e.g.:

import re
re.search(r'P\d', '10,000 cell Rep1 (P1)').group()

'P1'

def long_pooled_name_to_short(x):
    if 'P' not in x:
        return x
    else:
        return re.search(r'P\d', x).group()


splicing.index.map(long_pooled_name_to_short)

array([u'S1', u'S2', u'S3', u'S4', u'S5', u'S6', u'S7', u'S8', u'S9',
       u'S10', u'S11', u'S13', u'S14', u'S15', u'S16', u'S17', u'S18',
       u'P1', u'P2', u'P3'], dtype=object)

And now we assign this new index as our index to the splicing dataframe

splicing.index = splicing.index.map(long_pooled_name_to_short)
splicing.head()

Event name	chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:-	chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:-	chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+	chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+	chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+	chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+	chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+	chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+	chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:-	chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+	...	chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+	chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:-	chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+	chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:-	chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:-	chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:-	chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:-	chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+	chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+	chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene	Os9	Vta1	Bclaf1	Bclaf1	P4ha1	Bsg	Ptbp1	Igf1	Elk3	Nbr1	...	Afg3l1	Tep1	Fgd2	Ttc17	Tmed1	Tmed1	Sbno2	Synj2	Tbc1d13	Usp47
S1	0.84	0.95	NaN	0.02	0.42	NaN	0.57	0.31	0.93	0.57	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S2	NaN	NaN	0.04	0.98	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S3	NaN	NaN	0.02	0.55	NaN	NaN	NaN	0.20	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S4	NaN	0.84	NaN	NaN	NaN	NaN	NaN	0.95	NaN	0.04	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
S5	NaN	0.95	NaN	NaN	0.94	NaN	NaN	0.73	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 352 columns

Metadata¶

Now let’s get into creating a metadata dataframe. We’ll use the index from the expression_filtered data to create the minimum required column, 'phenotype', which has the name of the phenotype of that cell. And we’ll also add the column 'pooled' to indicate whether this sample is pooled or not.

metadata = pd.DataFrame(index=expression_filtered.index)
metadata['phenotype'] = 'BDMC'
metadata['pooled'] = metadata.index.map(lambda x: x.startswith('P'))

metadata

	phenotype	pooled
S1	BDMC	False
S2	BDMC	False
S3	BDMC	False
S4	BDMC	False
S5	BDMC	False
S6	BDMC	False
S7	BDMC	False
S8	BDMC	False
S9	BDMC	False
S10	BDMC	False
S11	BDMC	False
S12	BDMC	False
S13	BDMC	False
S14	BDMC	False
S15	BDMC	False
S16	BDMC	False
S17	BDMC	False
S18	BDMC	False
P1	BDMC	True
P2	BDMC	True
P3	BDMC	True

Mapping stats data¶

mapping_stats = pd.read_excel('nature12172-s1/Supplementary_Table1.xls', sheetname='SuppTable1 2.txt')
mapping_stats

	Sample	PF_READS	PCT_MAPPED_GENOME	PCT_RIBOSOMAL_BASES	MEDIAN_CV_COVERAGE	MEDIAN_5PRIME_BIAS	MEDIAN_3PRIME_BIAS	MEDIAN_5PRIME_TO_3PRIME_BIAS
0	S1	21326048	0.706590	0.006820	0.509939	0.092679	0.477321	0.247741
1	S2	27434011	0.745385	0.004111	0.565732	0.056583	0.321053	0.244062
2	S3	31142391	0.722087	0.006428	0.540341	0.079551	0.382286	0.267367
3	S4	26231852	0.737854	0.004959	0.530978	0.067041	0.351670	0.279782
4	S5	29977214	0.746466	0.006121	0.525598	0.066543	0.353995	0.274252
5	S6	24148387	0.730079	0.008794	0.529650	0.072095	0.413696	0.225929
6	S7	24078116	0.730638	0.007945	0.540913	0.051991	0.358597	0.201984
7	S8	25032126	0.739989	0.004133	0.512725	0.058783	0.373509	0.212337
8	S9	22257682	0.747427	0.004869	0.521622	0.063566	0.334294	0.240641
9	S10	29436289	0.748795	0.005499	0.560454	0.036219	0.306729	0.187479
10	S11	31130278	0.741882	0.002740	0.558882	0.049581	0.349191	0.211787
11	S12	21161595	0.750782	0.006837	0.756339	0.013878	0.324264	0.195430
12	S13	28612833	0.733976	0.011718	0.598687	0.035392	0.357447	0.198566
13	S14	26351189	0.748323	0.004106	0.517518	0.070293	0.381095	0.259122
14	S15	25739575	0.748421	0.003353	0.526238	0.050938	0.324207	0.212366
15	S16	26802346	0.739833	0.009370	0.520287	0.071503	0.358758	0.240009
16	S17	26343522	0.749358	0.003155	0.673195	0.024121	0.301588	0.245854
17	S18	25290073	0.749358	0.007465	0.562382	0.048528	0.314776	0.215160
18	10k_rep1	28247826	0.688553	0.018993	0.547000	0.056113	0.484393	0.140333
19	10k_rep2	39303876	0.690313	0.017328	0.547621	0.055600	0.474634	0.142889
20	10k_rep3	29831281	0.710875	0.010610	0.518053	0.066053	0.488738	0.168180
21	MB_SC1	13848219	0.545000	0.007000	0.531495	0.127934	0.207841	0.728980
22	MB_SC2	13550218	0.458000	0.010800	0.569271	0.102581	0.179407	0.694747
23	MB_SC3	26765848	0.496000	0.007900	0.535192	0.141893	0.231068	0.722080

Create a `flotilla` Study!¶

study = flotilla.Study(# The metadata describing phenotype and pooled samples
                       metadata,

                       # A version for this data
                       version='0.1.0',

                       # Dataframe of the filtered expression data
                       expression_data=expression_filtered,

                       # Dataframe of the feature data of the genes
                       expression_feature_data=expression_feature_data,

                       # Dataframe of the splicing data
                       splicing_data=splicing,

                       # Dataframe of the mapping stats data
                       mapping_stats_data=mapping_stats,

                       # Which column in "mapping_stats" has the number of reads
                       mapping_stats_number_mapped_col='PF_READS')

2014-11-04 14:19:33 Initializing Study
2014-11-04 14:19:33 Initializing Predictor configuration manager for Study
2014-11-04 14:19:33 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2014-11-04 14:19:33 Added ExtraTreesClassifier to default predictors
2014-11-04 14:19:33 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2014-11-04 14:19:33 Added ExtraTreesRegressor to default predictors
2014-11-04 14:19:33 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2014-11-04 14:19:33 Added GradientBoostingClassifier to default predictors
2014-11-04 14:19:33 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2014-11-04 14:19:33 Added GradientBoostingRegressor to default predictors
2014-11-04 14:19:33 Loading metadata
2014-11-04 14:19:33 Loading expression data
2014-11-04 14:19:33 Initializing expression
2014-11-04 14:19:33 Done initializing expression
2014-11-04 14:19:33 Loading splicing data
2014-11-04 14:19:33 Initializing splicing
2014-11-04 14:19:33 Done initializing splicing
2014-11-04 14:19:33 Successfully initialized a Study object!

No phenotype to color mapping was provided, so coming up with reasonable defaults
No phenotype to marker (matplotlib plotting symbol) was provided, so each phenotype will be plotted as a circle in the PCA visualizations.
samples had too few mapped reads (<5.0e+05 reads):

Removing technical outliers from consideration in expression:

Removing technical outliers from consideration in splicing:

As a side note, you can save this study to disk now, so you can “embark” later:

study.save('shalek2013')

Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json

Note that this is saved to my home directory, in ~/flotilla_projects/<study_name>/. This will be saved in your home directory, too.

The datapackage.json file is what holds all the information relative to the study, and loosely follows the datapackage spec created by the Open Knowledge Foundation.

cat /Users/olga/flotilla_projects/shalek2013/datapackage.json

{
  "name": "shalek2013",
  "title": null,
  "datapackage_version": "0.1.1",
  "sources": null,
  "licenses": null,
  "resources": [
    {
      "path": "/Users/olga/flotilla_projects/shalek2013/splicing.csv.gz",
      "format": "csv",
      "name": "splicing",
      "compression": "gzip"
    },
    {
      "number_mapped_col": "PF_READS",
      "path": "/Users/olga/flotilla_projects/shalek2013/mapping_stats.csv.gz",
      "format": "csv",
      "name": "mapping_stats",
      "compression": "gzip"
    },
    {
      "name": "expression_feature",
      "format": "csv",
      "rename_col": null,
      "ignore_subset_cols": [],
      "path": "/Users/olga/flotilla_projects/shalek2013/expression_feature.csv.gz",
      "compression": "gzip"
    },
    {
      "name": "expression",
      "log_base": null,
      "format": "csv",
      "thresh": -Infinity,
      "path": "/Users/olga/flotilla_projects/shalek2013/expression.csv.gz",
      "compression": "gzip"
    },
    {
      "name": "splicing_feature",
      "format": "csv",
      "rename_col": "gene_name",
      "ignore_subset_cols": [],
      "path": "/Users/olga/flotilla_projects/shalek2013/splicing_feature.csv.gz",
      "compression": "gzip"
    },
    {
      "pooled_col": "pooled",
      "name": "metadata",
      "phenotype_to_marker": {
        "BDMC": "o"
      },
      "format": "csv",
      "minimum_samples": 0,
      "phenotype_to_color": {
        "BDMC": "#1b9e77"
      },
      "path": "/Users/olga/flotilla_projects/shalek2013/metadata.csv.gz",
      "phenotype_col": "phenotype",
      "phenotype_order": [
        "BDMC"
      ],
      "compression": "gzip"
    }
  ]
}

One thing to note is that when you save, the version number is bumped up. study.version (the one we just made) is 0.1.0, but the one we saved is 0.1.1, since we could have made some changes to the data.

Let’s look at what else is in this folder:

ls /Users/olga/flotilla_projects/shalek2013

datapackage.json           expression_feature.csv.gz  metadata.csv.gz            splicing_feature.csv.gz
expression.csv.gz          mapping_stats.csv.gz       splicing.csv.gz

So this is where all the other files are. Good to know!

We can “embark” on this newly-saved study now very painlessly, without having to open and process all those files again:

study2 = flotilla.embark('shalek2013')

2014-10-30 12:16:56 Reading datapackage from /Users/olga/flotilla_projects/shalek2013/datapackage.json
2014-10-30 12:16:56 Parsing datapackage to create a Study object
2014-10-30 12:16:56 Initializing Study
2014-10-30 12:16:56 Initializing Predictor configuration manager for Study
2014-10-30 12:16:56 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2014-10-30 12:16:56 Added ExtraTreesClassifier to default predictors
2014-10-30 12:16:56 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2014-10-30 12:16:56 Added ExtraTreesRegressor to default predictors
2014-10-30 12:16:56 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2014-10-30 12:16:56 Added GradientBoostingClassifier to default predictors
2014-10-30 12:16:56 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2014-10-30 12:16:56 Added GradientBoostingRegressor to default predictors
2014-10-30 12:16:56 Loading metadata
2014-10-30 12:16:56 Loading expression data
2014-10-30 12:16:56 Initializing expression
2014-10-30 12:16:56 Done initializing expression
2014-10-30 12:16:56 Loading splicing data
2014-10-30 12:16:56 Initializing splicing
2014-10-30 12:16:56 Done initializing splicing
2014-10-30 12:16:56 Successfully initialized a Study object!

Create a barebones datapackage¶

Shalek and Sujita, et al (2013)¶

Expression data¶

Expression feature data¶

Splicing Data¶

Metadata¶

Mapping stats data¶

Create a flotilla Study!¶

Create a `flotilla` Study!¶