Before we begin, let’s import everything we need.
# Import the flotilla package for biological data analysis
import flotilla
# Import "numerical python" library for number crunching
import numpy as np
# Import "panel data analysis" library for tabular data
import pandas as pd
In the 2013 paper, Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells (Shalek and Sujita, et al. Nature (2013)), Regev and colleagues performed single-cell sequencing 18 bone marrow-derived dendritic cells (BMDCs), in addition to 3 pooled samples.
First, we will read in the expression data. These data were obtained using,
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz
We will also compare to the supplementary table 2 data, obtained using
wget http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip
unzip nature12172-s1.zip
expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0)
expression.head()
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-2-b45e9120e279> in <module>()
----> 1 expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0)
2 expression.head()
/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472 skip_blank_lines=skip_blank_lines)
473
--> 474 return _read(filepath_or_buffer, kwds)
475
476 parser_f.__name__ = name
/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
248
249 # Create the parser.
--> 250 parser = TextFileReader(filepath_or_buffer, **kwds)
251
252 if (nrows is not None) and (chunksize is not None):
/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
564 self.options['has_index_names'] = kwds['has_index_names']
565
--> 566 self._make_engine(self.engine)
567
568 def _get_options_with_defaults(self, engine):
/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
703 def _make_engine(self, engine='c'):
704 if engine == 'c':
--> 705 self._engine = CParserWrapper(self.f, **self.options)
706 else:
707 if engine == 'python':
/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
1070 kwds['allow_leading_cols'] = self.index_col is not False
1071
-> 1072 self._reader = _parser.TextReader(src, **kwds)
1073
1074 # XXX
/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)()
/home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5468)()
/home/travis/miniconda/envs/testenv/lib/python2.7/gzip.pyc in __init__(self, filename, mode, compresslevel, fileobj, mtime)
92 mode += 'b'
93 if fileobj is None:
---> 94 fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
95 if filename is None:
96 # Issue #13781: os.fdopen() creates a fileobj with a bogus name
IOError: [Errno 2] No such file or directory: 'GSE41265_allGenesTPM.txt.gz'
These data are in the “transcripts per million,” aka TPM unit. See this blog post if that sounds weird to you.
These data are formatted with samples on the columns, and genes on the rows. But we want the opposite, with samples on the rows and genes on the columns. This follows `scikit-learn <http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-an-example-dataset>`_’s standard of data matrices with size (n_samples, n_features) as each gene is a feature. So we will simply transpose this.
expression = expression.T
expression.head()
GENE | XKR4 | AB338584 | B3GAT2 | NPL | T2 | T | PDE10A | 1700010I14RIK | 6530411M01RIK | PABPC6 | ... | AK085062 | DHX9 | RNASET2B | FGFR1OP | CCR6 | BRP44L | AK014435 | AK015714 | SFT2D1 | PRR18 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S1 | 0 | 0 | 0.000000 | 72.008590 | 0.109249 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0.774638 | 23.520936 | 0.000000 | 0 | 460.316773 | 0 | 0.000000 | 39.442566 | 0 |
S2 | 0 | 0 | 0.000000 | 0.000000 | 0.172009 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0.367391 | 1.887873 | 0.000000 | 0 | 823.890290 | 0 | 0.000000 | 4.967412 | 0 |
S3 | 0 | 0 | 0.023441 | 128.062012 | 0.000000 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0.249858 | 0.313510 | 0.166772 | 0 | 1002.354241 | 0 | 0.000000 | 0.000000 | 0 |
S4 | 0 | 0 | 0.000000 | 0.095082 | 0.000000 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0.354157 | 0.000000 | 0.887003 | 0 | 1230.766795 | 0 | 0.000000 | 0.131215 | 0 |
S5 | 0 | 0 | 0.000000 | 0.000000 | 0.182703 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0.039263 | 0.000000 | 131.077131 | 0 | 1614.749122 | 0 | 0.242179 | 95.485743 | 0 |
5 rows × 27723 columns
The authors filtered the expression data based on having at least 3 single cells express genes with at TPM (transcripts per million, ) > 1. We can express this in using the `pandas <http://pandas.pydata.org>`_ DataFrames easily.
First, from reading the paper and looking at the data, I know there are 18 single cells, and there are 18 samples that start with the letter “S.” So I will extract the single samples from the index (row names) using a lambda, a tiny function which in this case, tells me whether or not that sample id begins with the letter “S”.
singles_ids = expression.index[expression.index.map(lambda x: x.startswith('S'))]
print('number of single cells:', len(singles_ids))
singles = expression.ix[singles_ids]
expression_filtered = expression.ix[:, singles[singles > 1].count() >= 3]
expression_filtered = np.log(expression_filtered + 1)
expression_filtered.shape
('number of single cells:', 18)
(21, 6312)
Hmm, that’s strange. The paper states that they had 6313 genes after filtering, but I get 6312. Even using “singles >= 1” doesn’t help.
(I also tried this with the expression table provided in the supplementary data as “SupplementaryTable2.xlsx,” and got the same results.)
Now that we’ve taken care of importing and filtering the expression data, let’s do the feature data of the expression data.
This is similar to the fData from BioconductoR, where there’s some additional data on your features that you want to look at. They uploaded information about the features in their OTHER expression matrix, uploaded as a supplementary file, Supplementary_Table2.xlsx.
Notice that this is a csv and not an xlsx. This is because Excel mangled the gene IDS that started with 201* and assumed they were dates :(
The workaround I did was to add another column to the sheet with the formula ="'" & A1, press Command-Shift-End to select the end of the rows, and then do Ctrl-D to “fill down” to the bottom (thanks to this stackoverflow post for teaching me how to Excel). Then, I saved the file as a csv for maximum portability and compatibility.
expression2 = pd.read_csv('nature12172-s1/Supplementary_Table2.csv',
# Need to specify the index column as both the first and the last columns,
# Because the last column is the "Gene Category"
index_col=[0, -1], parse_dates=False, infer_datetime_format=False)
# This was also in features x samples format, so we need to transpose
expression2 = expression2.T
expression2.head()
'GENE | '0610007L01RIK | '0610007P14RIK | '0610007P22RIK | '0610008F07RIK | '0610009B22RIK | '0610009D07RIK | '0610009O20RIK | '0610010B08RIK | '0610010F05RIK | '0610010K06RIK | ... | 'ZWILCH | 'ZWINT | 'ZXDA | 'ZXDB | 'ZXDC | 'ZYG11A | 'ZYG11B | 'ZYX | 'ZZEF1 | 'ZZZ3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gene Category | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S1 | 27.181570 | 0.166794 | 0 | 0 | 0.000000 | 178.852732 | 0 | 0.962417 | 0.000000 | 143.359550 | ... | 0.000000 | 302.361227 | 0.000000 | 0 | 0 | 0 | 0.027717 | 297.918756 | 37.685501 | 0.000000 |
S2 | 37.682691 | 0.263962 | 0 | 0 | 0.207921 | 0.141099 | 0 | 0.000000 | 0.000000 | 0.255617 | ... | 0.000000 | 96.033724 | 0.020459 | 0 | 0 | 0 | 0.042430 | 0.242888 | 0.000000 | 0.000000 |
S3 | 0.056916 | 78.622459 | 0 | 0 | 0.145680 | 0.396363 | 0 | 0.000000 | 0.024692 | 72.775846 | ... | 0.000000 | 427.915555 | 0.000000 | 0 | 0 | 0 | 0.040407 | 6.753530 | 0.132011 | 0.017615 |
S4 | 55.649250 | 0.228866 | 0 | 0 | 0.000000 | 88.798158 | 0 | 0.000000 | 0.000000 | 93.825442 | ... | 0.000000 | 9.788557 | 0.017787 | 0 | 0 | 0 | 0.013452 | 0.274689 | 9.724890 | 0.000000 |
S5 | 0.000000 | 0.093117 | 0 | 0 | 131.326008 | 155.936361 | 0 | 0.000000 | 0.000000 | 0.031029 | ... | 0.204522 | 26.575760 | 0.000000 | 0 | 0 | 0 | 1.101589 | 59.256094 | 44.430726 | 0.000000 |
5 rows × 27723 columns
Now we need to strip the single-quote I added to all the gene names:
new_index, indexer = expression2.columns.reindex(map(lambda x: (x[0].lstrip("'"), x[1]), expression2.columns.values))
expression2.columns = new_index
expression2.head()
'GENE | 0610007L01RIK | 0610007P14RIK | 0610007P22RIK | 0610008F07RIK | 0610009B22RIK | 0610009D07RIK | 0610009O20RIK | 0610010B08RIK | 0610010F05RIK | 0610010K06RIK | ... | ZWILCH | ZWINT | ZXDA | ZXDB | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gene Category | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S1 | 27.181570 | 0.166794 | 0 | 0 | 0.000000 | 178.852732 | 0 | 0.962417 | 0.000000 | 143.359550 | ... | 0.000000 | 302.361227 | 0.000000 | 0 | 0 | 0 | 0.027717 | 297.918756 | 37.685501 | 0.000000 |
S2 | 37.682691 | 0.263962 | 0 | 0 | 0.207921 | 0.141099 | 0 | 0.000000 | 0.000000 | 0.255617 | ... | 0.000000 | 96.033724 | 0.020459 | 0 | 0 | 0 | 0.042430 | 0.242888 | 0.000000 | 0.000000 |
S3 | 0.056916 | 78.622459 | 0 | 0 | 0.145680 | 0.396363 | 0 | 0.000000 | 0.024692 | 72.775846 | ... | 0.000000 | 427.915555 | 0.000000 | 0 | 0 | 0 | 0.040407 | 6.753530 | 0.132011 | 0.017615 |
S4 | 55.649250 | 0.228866 | 0 | 0 | 0.000000 | 88.798158 | 0 | 0.000000 | 0.000000 | 93.825442 | ... | 0.000000 | 9.788557 | 0.017787 | 0 | 0 | 0 | 0.013452 | 0.274689 | 9.724890 | 0.000000 |
S5 | 0.000000 | 0.093117 | 0 | 0 | 131.326008 | 155.936361 | 0 | 0.000000 | 0.000000 | 0.031029 | ... | 0.204522 | 26.575760 | 0.000000 | 0 | 0 | 0 | 1.101589 | 59.256094 | 44.430726 | 0.000000 |
5 rows × 27723 columns
We want to create a pandas.DataFrame from the “Gene Category” row for our expression_feature_data, which we will do via:
gene_ids, gene_category = zip(*expression2.columns.values)
gene_categories = pd.Series(gene_category, index=gene_ids, name='gene_category')
gene_categories
0610007L01RIK NaN
0610007P14RIK NaN
0610007P22RIK NaN
0610008F07RIK NaN
0610009B22RIK NaN
0610009D07RIK NaN
0610009O20RIK NaN
0610010B08RIK NaN
0610010F05RIK NaN
0610010K06RIK NaN
0610010K14RIK NaN
0610010O12RIK NaN
0610011F06RIK NaN
0610011L14RIK NaN
0610012G03RIK NaN
...
ZSWIM5 NaN
ZSWIM6 NaN
ZSWIM7 NaN
ZUFSP LPS Response
ZW10 NaN
ZWILCH NaN
ZWINT NaN
ZXDA NaN
ZXDB NaN
ZXDC NaN
ZYG11A NaN
ZYG11B NaN
ZYX NaN
ZZEF1 NaN
ZZZ3 NaN
Name: gene_category, Length: 27723, dtype: object
expression_feature_data = pd.DataFrame(gene_categories)
expression_feature_data.head()
gene_category | |
---|---|
0610007L01RIK | NaN |
0610007P14RIK | NaN |
0610007P22RIK | NaN |
0610008F07RIK | NaN |
0610009B22RIK | NaN |
We obtain the splicing data from this study from the supplementary information, specifically the Supplementary_Table4.xls
splicing = pd.read_excel('nature12172-s1/Supplementary_Table4.xls', 'splicingTable.txt', index_col=(0,1))
splicing.head()
S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | S9 | S10 | S11 | S13 | S14 | S15 | S16 | S17 | S18 | 10,000 cell Rep1 (P1) | 10,000 cell Rep2 (P2) | 10,000 cell Rep3 (P3) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Event name | gene | ||||||||||||||||||||
chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- | Os9 | 0.84 | NaN | NaN | NaN | NaN | 0.01 | NaN | NaN | NaN | NaN | 0.03 | NaN | NaN | 0.02 | NaN | 0.01 | NaN | 0.27 | 0.37 | 0.31 |
chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- | Vta1 | 0.95 | NaN | NaN | 0.84 | 0.95 | 0.91 | 0.87 | 0.86 | NaN | NaN | 0.93 | NaN | NaN | 0.96 | NaN | NaN | NaN | 0.83 | 0.85 | 0.64 |
chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ | Bclaf1 | NaN | 0.04 | 0.02 | NaN | NaN | 0.14 | NaN | 0.02 | NaN | NaN | NaN | NaN | NaN | 0.01 | NaN | NaN | NaN | 0.40 | 0.49 | 0.59 |
chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ | Bclaf1 | 0.02 | 0.98 | 0.55 | NaN | NaN | NaN | NaN | 0.98 | NaN | NaN | NaN | NaN | NaN | 0.06 | NaN | NaN | NaN | 0.62 | 0.63 | 0.70 |
chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ | P4ha1 | 0.42 | NaN | NaN | NaN | 0.94 | NaN | NaN | 0.03 | 0.97 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.43 | 0.36 | 0.52 |
splicing = splicing.T
splicing
Event name | chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- | chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- | chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ | chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ | chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ | chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ | chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ | chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ | chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- | chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ | ... | chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ | chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- | chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ | chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- | chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- | chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- | chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- | chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ | chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ | chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | Os9 | Vta1 | Bclaf1 | Bclaf1 | P4ha1 | Bsg | Ptbp1 | Igf1 | Elk3 | Nbr1 | ... | Afg3l1 | Tep1 | Fgd2 | Ttc17 | Tmed1 | Tmed1 | Sbno2 | Synj2 | Tbc1d13 | Usp47 |
S1 | 0.84 | 0.95 | NaN | 0.02 | 0.42 | NaN | 0.57 | 0.31 | 0.93 | 0.57 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S2 | NaN | NaN | 0.04 | 0.98 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S3 | NaN | NaN | 0.02 | 0.55 | NaN | NaN | NaN | 0.20 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S4 | NaN | 0.84 | NaN | NaN | NaN | NaN | NaN | 0.95 | NaN | 0.04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S5 | NaN | 0.95 | NaN | NaN | 0.94 | NaN | NaN | 0.73 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S6 | 0.01 | 0.91 | 0.14 | NaN | NaN | NaN | NaN | 0.61 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S7 | NaN | 0.87 | NaN | NaN | NaN | 0.62 | NaN | 0.85 | 0.73 | 0.55 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S8 | NaN | 0.86 | 0.02 | 0.98 | 0.03 | NaN | NaN | 0.89 | 0.82 | 0.83 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S9 | NaN | NaN | NaN | NaN | 0.97 | NaN | 0.97 | NaN | 0.90 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S10 | NaN | NaN | NaN | NaN | NaN | NaN | 0.06 | 0.98 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S11 | 0.03 | 0.93 | NaN | NaN | NaN | NaN | NaN | NaN | 0.97 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.88 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S15 | 0.02 | 0.96 | 0.01 | 0.06 | NaN | NaN | NaN | 0.44 | NaN | NaN | ... | 0.91 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 0.27 | 0.99 | 0.99 | 0.98 | 0.98 | NaN | NaN | NaN | NaN |
S17 | 0.01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.96 | NaN | NaN | NaN | 0.99 | 0.98 | 0.67 | 0.07 |
S18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10,000 cell Rep1 (P1) | 0.27 | 0.83 | 0.40 | 0.62 | 0.43 | 0.78 | NaN | 0.60 | 0.76 | 0.52 | ... | 0.92 | NaN | 0.81 | 0.77 | NaN | NaN | 0.84 | 0.50 | 0.56 | NaN |
10,000 cell Rep2 (P2) | 0.37 | 0.85 | 0.49 | 0.63 | 0.36 | 0.72 | 0.47 | 0.60 | 0.73 | 0.68 | ... | 0.67 | 0.15 | 0.52 | 0.67 | 0.63 | 0.73 | 0.82 | 0.90 | 0.71 | 0.55 |
10,000 cell Rep3 (P3) | 0.31 | 0.64 | 0.59 | 0.70 | 0.52 | 0.79 | NaN | 0.65 | 0.42 | 0.64 | ... | 0.58 | 0.79 | 0.74 | 0.85 | 0.73 | 0.39 | 0.56 | NaN | 0.64 | NaN |
20 rows × 352 columns
The three pooled samples aren’t named consistently with the expression data, so we have to fix that.
splicing.index[splicing.index.map(lambda x: 'P' in x)]
Index([u'10,000 cell Rep1 (P1)', u'10,000 cell Rep2 (P2)', u'10,000 cell Rep3 (P3)'], dtype='object')
Since the pooled sample IDs are inconsistent with the expression data, we have to change them. We can get the “P” and the number after that using regular expressions, called re in the Python standard library, e.g.:
import re
re.search(r'P\d', '10,000 cell Rep1 (P1)').group()
'P1'
def long_pooled_name_to_short(x):
if 'P' not in x:
return x
else:
return re.search(r'P\d', x).group()
splicing.index.map(long_pooled_name_to_short)
array([u'S1', u'S2', u'S3', u'S4', u'S5', u'S6', u'S7', u'S8', u'S9',
u'S10', u'S11', u'S13', u'S14', u'S15', u'S16', u'S17', u'S18',
u'P1', u'P2', u'P3'], dtype=object)
And now we assign this new index as our index to the splicing dataframe
splicing.index = splicing.index.map(long_pooled_name_to_short)
splicing.head()
Event name | chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- | chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- | chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ | chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ | chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ | chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ | chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ | chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ | chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- | chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ | ... | chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ | chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- | chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ | chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- | chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- | chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- | chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- | chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ | chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ | chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | Os9 | Vta1 | Bclaf1 | Bclaf1 | P4ha1 | Bsg | Ptbp1 | Igf1 | Elk3 | Nbr1 | ... | Afg3l1 | Tep1 | Fgd2 | Ttc17 | Tmed1 | Tmed1 | Sbno2 | Synj2 | Tbc1d13 | Usp47 |
S1 | 0.84 | 0.95 | NaN | 0.02 | 0.42 | NaN | 0.57 | 0.31 | 0.93 | 0.57 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S2 | NaN | NaN | 0.04 | 0.98 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S3 | NaN | NaN | 0.02 | 0.55 | NaN | NaN | NaN | 0.20 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S4 | NaN | 0.84 | NaN | NaN | NaN | NaN | NaN | 0.95 | NaN | 0.04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
S5 | NaN | 0.95 | NaN | NaN | 0.94 | NaN | NaN | 0.73 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 352 columns
Now let’s get into creating a metadata dataframe. We’ll use the index from the expression_filtered data to create the minimum required column, 'phenotype', which has the name of the phenotype of that cell. And we’ll also add the column 'pooled' to indicate whether this sample is pooled or not.
metadata = pd.DataFrame(index=expression_filtered.index)
metadata['phenotype'] = 'BDMC'
metadata['pooled'] = metadata.index.map(lambda x: x.startswith('P'))
metadata
phenotype | pooled | |
---|---|---|
S1 | BDMC | False |
S2 | BDMC | False |
S3 | BDMC | False |
S4 | BDMC | False |
S5 | BDMC | False |
S6 | BDMC | False |
S7 | BDMC | False |
S8 | BDMC | False |
S9 | BDMC | False |
S10 | BDMC | False |
S11 | BDMC | False |
S12 | BDMC | False |
S13 | BDMC | False |
S14 | BDMC | False |
S15 | BDMC | False |
S16 | BDMC | False |
S17 | BDMC | False |
S18 | BDMC | False |
P1 | BDMC | True |
P2 | BDMC | True |
P3 | BDMC | True |
mapping_stats = pd.read_excel('nature12172-s1/Supplementary_Table1.xls', sheetname='SuppTable1 2.txt')
mapping_stats
Sample | PF_READS | PCT_MAPPED_GENOME | PCT_RIBOSOMAL_BASES | MEDIAN_CV_COVERAGE | MEDIAN_5PRIME_BIAS | MEDIAN_3PRIME_BIAS | MEDIAN_5PRIME_TO_3PRIME_BIAS | |
---|---|---|---|---|---|---|---|---|
0 | S1 | 21326048 | 0.706590 | 0.006820 | 0.509939 | 0.092679 | 0.477321 | 0.247741 |
1 | S2 | 27434011 | 0.745385 | 0.004111 | 0.565732 | 0.056583 | 0.321053 | 0.244062 |
2 | S3 | 31142391 | 0.722087 | 0.006428 | 0.540341 | 0.079551 | 0.382286 | 0.267367 |
3 | S4 | 26231852 | 0.737854 | 0.004959 | 0.530978 | 0.067041 | 0.351670 | 0.279782 |
4 | S5 | 29977214 | 0.746466 | 0.006121 | 0.525598 | 0.066543 | 0.353995 | 0.274252 |
5 | S6 | 24148387 | 0.730079 | 0.008794 | 0.529650 | 0.072095 | 0.413696 | 0.225929 |
6 | S7 | 24078116 | 0.730638 | 0.007945 | 0.540913 | 0.051991 | 0.358597 | 0.201984 |
7 | S8 | 25032126 | 0.739989 | 0.004133 | 0.512725 | 0.058783 | 0.373509 | 0.212337 |
8 | S9 | 22257682 | 0.747427 | 0.004869 | 0.521622 | 0.063566 | 0.334294 | 0.240641 |
9 | S10 | 29436289 | 0.748795 | 0.005499 | 0.560454 | 0.036219 | 0.306729 | 0.187479 |
10 | S11 | 31130278 | 0.741882 | 0.002740 | 0.558882 | 0.049581 | 0.349191 | 0.211787 |
11 | S12 | 21161595 | 0.750782 | 0.006837 | 0.756339 | 0.013878 | 0.324264 | 0.195430 |
12 | S13 | 28612833 | 0.733976 | 0.011718 | 0.598687 | 0.035392 | 0.357447 | 0.198566 |
13 | S14 | 26351189 | 0.748323 | 0.004106 | 0.517518 | 0.070293 | 0.381095 | 0.259122 |
14 | S15 | 25739575 | 0.748421 | 0.003353 | 0.526238 | 0.050938 | 0.324207 | 0.212366 |
15 | S16 | 26802346 | 0.739833 | 0.009370 | 0.520287 | 0.071503 | 0.358758 | 0.240009 |
16 | S17 | 26343522 | 0.749358 | 0.003155 | 0.673195 | 0.024121 | 0.301588 | 0.245854 |
17 | S18 | 25290073 | 0.749358 | 0.007465 | 0.562382 | 0.048528 | 0.314776 | 0.215160 |
18 | 10k_rep1 | 28247826 | 0.688553 | 0.018993 | 0.547000 | 0.056113 | 0.484393 | 0.140333 |
19 | 10k_rep2 | 39303876 | 0.690313 | 0.017328 | 0.547621 | 0.055600 | 0.474634 | 0.142889 |
20 | 10k_rep3 | 29831281 | 0.710875 | 0.010610 | 0.518053 | 0.066053 | 0.488738 | 0.168180 |
21 | MB_SC1 | 13848219 | 0.545000 | 0.007000 | 0.531495 | 0.127934 | 0.207841 | 0.728980 |
22 | MB_SC2 | 13550218 | 0.458000 | 0.010800 | 0.569271 | 0.102581 | 0.179407 | 0.694747 |
23 | MB_SC3 | 26765848 | 0.496000 | 0.007900 | 0.535192 | 0.141893 | 0.231068 | 0.722080 |
study = flotilla.Study(# The metadata describing phenotype and pooled samples
metadata,
# A version for this data
version='0.1.0',
# Dataframe of the filtered expression data
expression_data=expression_filtered,
# Dataframe of the feature data of the genes
expression_feature_data=expression_feature_data,
# Dataframe of the splicing data
splicing_data=splicing,
# Dataframe of the mapping stats data
mapping_stats_data=mapping_stats,
# Which column in "mapping_stats" has the number of reads
mapping_stats_number_mapped_col='PF_READS')
2014-11-04 14:19:33 Initializing Study
2014-11-04 14:19:33 Initializing Predictor configuration manager for Study
2014-11-04 14:19:33 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2014-11-04 14:19:33 Added ExtraTreesClassifier to default predictors
2014-11-04 14:19:33 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2014-11-04 14:19:33 Added ExtraTreesRegressor to default predictors
2014-11-04 14:19:33 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2014-11-04 14:19:33 Added GradientBoostingClassifier to default predictors
2014-11-04 14:19:33 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2014-11-04 14:19:33 Added GradientBoostingRegressor to default predictors
2014-11-04 14:19:33 Loading metadata
2014-11-04 14:19:33 Loading expression data
2014-11-04 14:19:33 Initializing expression
2014-11-04 14:19:33 Done initializing expression
2014-11-04 14:19:33 Loading splicing data
2014-11-04 14:19:33 Initializing splicing
2014-11-04 14:19:33 Done initializing splicing
2014-11-04 14:19:33 Successfully initialized a Study object!
No phenotype to color mapping was provided, so coming up with reasonable defaults
No phenotype to marker (matplotlib plotting symbol) was provided, so each phenotype will be plotted as a circle in the PCA visualizations.
samples had too few mapped reads (<5.0e+05 reads):
Removing technical outliers from consideration in expression:
Removing technical outliers from consideration in splicing:
As a side note, you can save this study to disk now, so you can “embark” later:
study.save('shalek2013')
Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json
Note that this is saved to my home directory, in ~/flotilla_projects/<study_name>/. This will be saved in your home directory, too.
The datapackage.json file is what holds all the information relative to the study, and loosely follows the datapackage spec created by the Open Knowledge Foundation.
cat /Users/olga/flotilla_projects/shalek2013/datapackage.json
{
"name": "shalek2013",
"title": null,
"datapackage_version": "0.1.1",
"sources": null,
"licenses": null,
"resources": [
{
"path": "/Users/olga/flotilla_projects/shalek2013/splicing.csv.gz",
"format": "csv",
"name": "splicing",
"compression": "gzip"
},
{
"number_mapped_col": "PF_READS",
"path": "/Users/olga/flotilla_projects/shalek2013/mapping_stats.csv.gz",
"format": "csv",
"name": "mapping_stats",
"compression": "gzip"
},
{
"name": "expression_feature",
"format": "csv",
"rename_col": null,
"ignore_subset_cols": [],
"path": "/Users/olga/flotilla_projects/shalek2013/expression_feature.csv.gz",
"compression": "gzip"
},
{
"name": "expression",
"log_base": null,
"format": "csv",
"thresh": -Infinity,
"path": "/Users/olga/flotilla_projects/shalek2013/expression.csv.gz",
"compression": "gzip"
},
{
"name": "splicing_feature",
"format": "csv",
"rename_col": "gene_name",
"ignore_subset_cols": [],
"path": "/Users/olga/flotilla_projects/shalek2013/splicing_feature.csv.gz",
"compression": "gzip"
},
{
"pooled_col": "pooled",
"name": "metadata",
"phenotype_to_marker": {
"BDMC": "o"
},
"format": "csv",
"minimum_samples": 0,
"phenotype_to_color": {
"BDMC": "#1b9e77"
},
"path": "/Users/olga/flotilla_projects/shalek2013/metadata.csv.gz",
"phenotype_col": "phenotype",
"phenotype_order": [
"BDMC"
],
"compression": "gzip"
}
]
}
One thing to note is that when you save, the version number is bumped up. study.version (the one we just made) is 0.1.0, but the one we saved is 0.1.1, since we could have made some changes to the data.
Let’s look at what else is in this folder:
ls /Users/olga/flotilla_projects/shalek2013
datapackage.json expression_feature.csv.gz metadata.csv.gz splicing_feature.csv.gz
expression.csv.gz mapping_stats.csv.gz splicing.csv.gz
So this is where all the other files are. Good to know!
We can “embark” on this newly-saved study now very painlessly, without having to open and process all those files again:
study2 = flotilla.embark('shalek2013')
2014-10-30 12:16:56 Reading datapackage from /Users/olga/flotilla_projects/shalek2013/datapackage.json
2014-10-30 12:16:56 Parsing datapackage to create a Study object
2014-10-30 12:16:56 Initializing Study
2014-10-30 12:16:56 Initializing Predictor configuration manager for Study
2014-10-30 12:16:56 Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2014-10-30 12:16:56 Added ExtraTreesClassifier to default predictors
2014-10-30 12:16:56 Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2014-10-30 12:16:56 Added ExtraTreesRegressor to default predictors
2014-10-30 12:16:56 Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2014-10-30 12:16:56 Added GradientBoostingClassifier to default predictors
2014-10-30 12:16:56 Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2014-10-30 12:16:56 Added GradientBoostingRegressor to default predictors
2014-10-30 12:16:56 Loading metadata
2014-10-30 12:16:56 Loading expression data
2014-10-30 12:16:56 Initializing expression
2014-10-30 12:16:56 Done initializing expression
2014-10-30 12:16:56 Loading splicing data
2014-10-30 12:16:56 Initializing splicing
2014-10-30 12:16:56 Done initializing splicing
2014-10-30 12:16:56 Successfully initialized a Study object!