Create a barebones datapackage ============================== Before we begin, let's import everything we need. .. code:: python # Import the flotilla package for biological data analysis import flotilla # Import "numerical python" library for number crunching import numpy as np # Import "panel data analysis" library for tabular data import pandas as pd Shalek and Sujita, *et al* (2013) --------------------------------- In the 2013 paper, `Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells `_ (Shalek and Sujita, *et al*. *Nature* (2013)), Regev and colleagues performed single-cell sequencing 18 bone marrow-derived dendritic cells (BMDCs), in addition to 3 pooled samples. Expression data --------------- First, we will read in the expression data. These data were obtained using, :: wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz We will also compare to the supplementary table 2 data, obtained using :: wget http://www.nature.com/nature/journal/v498/n7453/extref/nature12172-s1.zip unzip nature12172-s1.zip .. code:: python expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0) expression.head() :: --------------------------------------------------------------------------- IOError Traceback (most recent call last) in () ----> 1 expression = pd.read_table("GSE41265_allGenesTPM.txt.gz", compression="gzip", index_col=0) 2 expression.head() /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines) 472 skip_blank_lines=skip_blank_lines) 473 --> 474 return _read(filepath_or_buffer, kwds) 475 476 parser_f.__name__ = name /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds) 248 249 # Create the parser. --> 250 parser = TextFileReader(filepath_or_buffer, **kwds) 251 252 if (nrows is not None) and (chunksize is not None): /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds) 564 self.options['has_index_names'] = kwds['has_index_names'] 565 --> 566 self._make_engine(self.engine) 567 568 def _get_options_with_defaults(self, engine): /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine) 703 def _make_engine(self, engine='c'): 704 if engine == 'c': --> 705 self._engine = CParserWrapper(self.f, **self.options) 706 else: 707 if engine == 'python': /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds) 1070 kwds['allow_leading_cols'] = self.index_col is not False 1071 -> 1072 self._reader = _parser.TextReader(src, **kwds) 1073 1074 # XXX /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)() /home/travis/miniconda/envs/testenv/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5468)() /home/travis/miniconda/envs/testenv/lib/python2.7/gzip.pyc in __init__(self, filename, mode, compresslevel, fileobj, mtime) 92 mode += 'b' 93 if fileobj is None: ---> 94 fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb') 95 if filename is None: 96 # Issue #13781: os.fdopen() creates a fileobj with a bogus name IOError: [Errno 2] No such file or directory: 'GSE41265_allGenesTPM.txt.gz' These data are in the "transcripts per million," aka TPM unit. See `this `_ blog post if that sounds weird to you. These data are formatted with samples on the columns, and genes on the rows. But we want the opposite, with samples on the rows and genes on the columns. This follows ```scikit-learn`` `_'s standard of data matrices with size (``n_samples``, ``n_features``) as each gene is a feature. So we will simply transpose this. .. code:: python expression = expression.T expression.head() .. raw:: html
GENE XKR4 AB338584 B3GAT2 NPL T2 T PDE10A 1700010I14RIK 6530411M01RIK PABPC6 ... AK085062 DHX9 RNASET2B FGFR1OP CCR6 BRP44L AK014435 AK015714 SFT2D1 PRR18
S1 0 0 0.000000 72.008590 0.109249 0 0 0 0 0 ... 0 0.774638 23.520936 0.000000 0 460.316773 0 0.000000 39.442566 0
S2 0 0 0.000000 0.000000 0.172009 0 0 0 0 0 ... 0 0.367391 1.887873 0.000000 0 823.890290 0 0.000000 4.967412 0
S3 0 0 0.023441 128.062012 0.000000 0 0 0 0 0 ... 0 0.249858 0.313510 0.166772 0 1002.354241 0 0.000000 0.000000 0
S4 0 0 0.000000 0.095082 0.000000 0 0 0 0 0 ... 0 0.354157 0.000000 0.887003 0 1230.766795 0 0.000000 0.131215 0
S5 0 0 0.000000 0.000000 0.182703 0 0 0 0 0 ... 0 0.039263 0.000000 131.077131 0 1614.749122 0 0.242179 95.485743 0

5 rows × 27723 columns

The authors filtered the expression data based on having at least 3 single cells express genes with at TPM (transcripts per million, ) > 1. We can express this in using the ```pandas`` `_ DataFrames easily. First, from reading the paper and looking at the data, I know there are 18 single cells, and there are 18 samples that start with the letter "S." So I will extract the single samples from the ``index`` (row names) using a ``lambda``, a tiny function which in this case, tells me whether or not that sample id begins with the letter "S". .. code:: python singles_ids = expression.index[expression.index.map(lambda x: x.startswith('S'))] print('number of single cells:', len(singles_ids)) singles = expression.ix[singles_ids] expression_filtered = expression.ix[:, singles[singles > 1].count() >= 3] expression_filtered = np.log(expression_filtered + 1) expression_filtered.shape .. parsed-literal:: ('number of single cells:', 18) .. parsed-literal:: (21, 6312) Hmm, that's strange. The paper states that they had 6313 genes after filtering, but I get 6312. Even using "``singles >= 1``" doesn't help. (I also tried this with the expression table provided in the supplementary data as "``SupplementaryTable2.xlsx``," and got the same results.) Now that we've taken care of importing and filtering the expression data, let's do the feature data of the expression data. Expression feature data ----------------------- This is similar to the ``fData`` from ``BioconductoR``, where there's some additional data on your features that you want to look at. They uploaded information about the features in their OTHER expression matrix, uploaded as a supplementary file, ``Supplementary_Table2.xlsx``. Notice that this is a ``csv`` and not an ``xlsx``. This is because Excel mangled the gene IDS that started with ``201*`` and assumed they were dates :( The workaround I did was to add another column to the sheet with the formula ``="'" & A1``, press ``Command``-``Shift``-``End`` to select the end of the rows, and then do ``Ctrl``-``D`` to "fill down" to the bottom (thanks to `this `_ stackoverflow post for teaching me how to Excel). Then, I saved the file as a ``csv`` for maximum portability and compatibility. .. code:: python expression2 = pd.read_csv('nature12172-s1/Supplementary_Table2.csv', # Need to specify the index column as both the first and the last columns, # Because the last column is the "Gene Category" index_col=[0, -1], parse_dates=False, infer_datetime_format=False) # This was also in features x samples format, so we need to transpose expression2 = expression2.T expression2.head() .. raw:: html
'GENE '0610007L01RIK '0610007P14RIK '0610007P22RIK '0610008F07RIK '0610009B22RIK '0610009D07RIK '0610009O20RIK '0610010B08RIK '0610010F05RIK '0610010K06RIK ... 'ZWILCH 'ZWINT 'ZXDA 'ZXDB 'ZXDC 'ZYG11A 'ZYG11B 'ZYX 'ZZEF1 'ZZZ3
Gene Category NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S1 27.181570 0.166794 0 0 0.000000 178.852732 0 0.962417 0.000000 143.359550 ... 0.000000 302.361227 0.000000 0 0 0 0.027717 297.918756 37.685501 0.000000
S2 37.682691 0.263962 0 0 0.207921 0.141099 0 0.000000 0.000000 0.255617 ... 0.000000 96.033724 0.020459 0 0 0 0.042430 0.242888 0.000000 0.000000
S3 0.056916 78.622459 0 0 0.145680 0.396363 0 0.000000 0.024692 72.775846 ... 0.000000 427.915555 0.000000 0 0 0 0.040407 6.753530 0.132011 0.017615
S4 55.649250 0.228866 0 0 0.000000 88.798158 0 0.000000 0.000000 93.825442 ... 0.000000 9.788557 0.017787 0 0 0 0.013452 0.274689 9.724890 0.000000
S5 0.000000 0.093117 0 0 131.326008 155.936361 0 0.000000 0.000000 0.031029 ... 0.204522 26.575760 0.000000 0 0 0 1.101589 59.256094 44.430726 0.000000

5 rows × 27723 columns

Now we need to strip the single-quote I added to all the gene names: .. code:: python new_index, indexer = expression2.columns.reindex(map(lambda x: (x[0].lstrip("'"), x[1]), expression2.columns.values)) expression2.columns = new_index expression2.head() .. raw:: html
'GENE 0610007L01RIK 0610007P14RIK 0610007P22RIK 0610008F07RIK 0610009B22RIK 0610009D07RIK 0610009O20RIK 0610010B08RIK 0610010F05RIK 0610010K06RIK ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
Gene Category NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S1 27.181570 0.166794 0 0 0.000000 178.852732 0 0.962417 0.000000 143.359550 ... 0.000000 302.361227 0.000000 0 0 0 0.027717 297.918756 37.685501 0.000000
S2 37.682691 0.263962 0 0 0.207921 0.141099 0 0.000000 0.000000 0.255617 ... 0.000000 96.033724 0.020459 0 0 0 0.042430 0.242888 0.000000 0.000000
S3 0.056916 78.622459 0 0 0.145680 0.396363 0 0.000000 0.024692 72.775846 ... 0.000000 427.915555 0.000000 0 0 0 0.040407 6.753530 0.132011 0.017615
S4 55.649250 0.228866 0 0 0.000000 88.798158 0 0.000000 0.000000 93.825442 ... 0.000000 9.788557 0.017787 0 0 0 0.013452 0.274689 9.724890 0.000000
S5 0.000000 0.093117 0 0 131.326008 155.936361 0 0.000000 0.000000 0.031029 ... 0.204522 26.575760 0.000000 0 0 0 1.101589 59.256094 44.430726 0.000000

5 rows × 27723 columns

We want to create a ``pandas.DataFrame`` from the "Gene Category" row for our ``expression_feature_data``, which we will do via: .. code:: python gene_ids, gene_category = zip(*expression2.columns.values) gene_categories = pd.Series(gene_category, index=gene_ids, name='gene_category') gene_categories .. parsed-literal:: 0610007L01RIK NaN 0610007P14RIK NaN 0610007P22RIK NaN 0610008F07RIK NaN 0610009B22RIK NaN 0610009D07RIK NaN 0610009O20RIK NaN 0610010B08RIK NaN 0610010F05RIK NaN 0610010K06RIK NaN 0610010K14RIK NaN 0610010O12RIK NaN 0610011F06RIK NaN 0610011L14RIK NaN 0610012G03RIK NaN ... ZSWIM5 NaN ZSWIM6 NaN ZSWIM7 NaN ZUFSP LPS Response ZW10 NaN ZWILCH NaN ZWINT NaN ZXDA NaN ZXDB NaN ZXDC NaN ZYG11A NaN ZYG11B NaN ZYX NaN ZZEF1 NaN ZZZ3 NaN Name: gene_category, Length: 27723, dtype: object .. code:: python expression_feature_data = pd.DataFrame(gene_categories) expression_feature_data.head() .. raw:: html
gene_category
0610007L01RIK NaN
0610007P14RIK NaN
0610007P22RIK NaN
0610008F07RIK NaN
0610009B22RIK NaN
Splicing Data ------------- We obtain the splicing data from this study from the supplementary information, specifically the ``Supplementary_Table4.xls`` .. code:: python splicing = pd.read_excel('nature12172-s1/Supplementary_Table4.xls', 'splicingTable.txt', index_col=(0,1)) splicing.head() .. raw:: html
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S13 S14 S15 S16 S17 S18 10,000 cell Rep1 (P1) 10,000 cell Rep2 (P2) 10,000 cell Rep3 (P3)
Event name gene
chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- Os9 0.84 NaN NaN NaN NaN 0.01 NaN NaN NaN NaN 0.03 NaN NaN 0.02 NaN 0.01 NaN 0.27 0.37 0.31
chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- Vta1 0.95 NaN NaN 0.84 0.95 0.91 0.87 0.86 NaN NaN 0.93 NaN NaN 0.96 NaN NaN NaN 0.83 0.85 0.64
chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ Bclaf1 NaN 0.04 0.02 NaN NaN 0.14 NaN 0.02 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN 0.40 0.49 0.59
chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ Bclaf1 0.02 0.98 0.55 NaN NaN NaN NaN 0.98 NaN NaN NaN NaN NaN 0.06 NaN NaN NaN 0.62 0.63 0.70
chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ P4ha1 0.42 NaN NaN NaN 0.94 NaN NaN 0.03 0.97 NaN NaN NaN NaN NaN NaN NaN NaN 0.43 0.36 0.52
.. code:: python splicing = splicing.T splicing .. raw:: html
Event name chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ ... chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene Os9 Vta1 Bclaf1 Bclaf1 P4ha1 Bsg Ptbp1 Igf1 Elk3 Nbr1 ... Afg3l1 Tep1 Fgd2 Ttc17 Tmed1 Tmed1 Sbno2 Synj2 Tbc1d13 Usp47
S1 0.84 0.95 NaN 0.02 0.42 NaN 0.57 0.31 0.93 0.57 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S2 NaN NaN 0.04 0.98 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S3 NaN NaN 0.02 0.55 NaN NaN NaN 0.20 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S4 NaN 0.84 NaN NaN NaN NaN NaN 0.95 NaN 0.04 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S5 NaN 0.95 NaN NaN 0.94 NaN NaN 0.73 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S6 0.01 0.91 0.14 NaN NaN NaN NaN 0.61 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S7 NaN 0.87 NaN NaN NaN 0.62 NaN 0.85 0.73 0.55 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S8 NaN 0.86 0.02 0.98 0.03 NaN NaN 0.89 0.82 0.83 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S9 NaN NaN NaN NaN 0.97 NaN 0.97 NaN 0.90 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S10 NaN NaN NaN NaN NaN NaN 0.06 0.98 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S11 0.03 0.93 NaN NaN NaN NaN NaN NaN 0.97 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S14 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.88 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S15 0.02 0.96 0.01 0.06 NaN NaN NaN 0.44 NaN NaN ... 0.91 NaN NaN NaN NaN NaN NaN NaN NaN NaN
S16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.27 0.99 0.99 0.98 0.98 NaN NaN NaN NaN
S17 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 0.96 NaN NaN NaN 0.99 0.98 0.67 0.07
S18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10,000 cell Rep1 (P1) 0.27 0.83 0.40 0.62 0.43 0.78 NaN 0.60 0.76 0.52 ... 0.92 NaN 0.81 0.77 NaN NaN 0.84 0.50 0.56 NaN
10,000 cell Rep2 (P2) 0.37 0.85 0.49 0.63 0.36 0.72 0.47 0.60 0.73 0.68 ... 0.67 0.15 0.52 0.67 0.63 0.73 0.82 0.90 0.71 0.55
10,000 cell Rep3 (P3) 0.31 0.64 0.59 0.70 0.52 0.79 NaN 0.65 0.42 0.64 ... 0.58 0.79 0.74 0.85 0.73 0.39 0.56 NaN 0.64 NaN

20 rows × 352 columns

The three pooled samples aren't named consistently with the expression data, so we have to fix that. .. code:: python splicing.index[splicing.index.map(lambda x: 'P' in x)] .. parsed-literal:: Index([u'10,000 cell Rep1 (P1)', u'10,000 cell Rep2 (P2)', u'10,000 cell Rep3 (P3)'], dtype='object') Since the pooled sample IDs are inconsistent with the ``expression`` data, we have to change them. We can get the "P" and the number after that using regular expressions, called ``re`` in the Python standard library, e.g.: .. code:: python import re re.search(r'P\d', '10,000 cell Rep1 (P1)').group() .. parsed-literal:: 'P1' .. code:: python def long_pooled_name_to_short(x): if 'P' not in x: return x else: return re.search(r'P\d', x).group() splicing.index.map(long_pooled_name_to_short) .. parsed-literal:: array([u'S1', u'S2', u'S3', u'S4', u'S5', u'S6', u'S7', u'S8', u'S9', u'S10', u'S11', u'S13', u'S14', u'S15', u'S16', u'S17', u'S18', u'P1', u'P2', u'P3'], dtype=object) And now we assign this new index as our index to the ``splicing`` dataframe .. code:: python splicing.index = splicing.index.map(long_pooled_name_to_short) splicing.head() .. raw:: html
Event name chr10:126534988:126535177:-@chr10:126533971:126534135:-@chr10:126533686:126533798:- chr10:14403870:14403945:-@chr10:14395740:14395848:-@chr10:14387738:14387914:- chr10:20051892:20052067:+@chr10:20052202:20052363:+@chr10:20053198:20053697:+ chr10:20052864:20053378:+@chr10:20054305:20054451:+@chr10:20059515:20059727:+ chr10:58814831:58815007:+@chr10:58817088:58817158:+@chr10:58818098:58818168:+@chr10:58824609:58824708:+ chr10:79173370:79173665:+@chr10:79174001:79174029:+@chr10:79174239:79174726:+ chr10:79322526:79322700:+@chr10:79322862:79322939:+@chr10:79323569:79323862:+ chr10:87376364:87376545:+@chr10:87378043:87378094:+@chr10:87393420:87399792:+ chr10:92747514:92747722:-@chr10:92727625:92728425:-@chr10:92717434:92717556:- chr11:101438508:101438565:+@chr11:101439246:101439351:+@chr11:101441899:101443267:+ ... chr8:126022488:126022598:+@chr8:126023892:126024007:+@chr8:126025133:126025333:+ chr14:51455667:51455879:-@chr14:51453589:51453752:-@chr14:51453129:51453242:- chr17:29497858:29498102:+@chr17:29500656:29500887:+@chr17:29501856:29502226:+ chr2:94198908:94199094:-@chr2:94182784:94182954:-@chr2:94172950:94173209:- chr9:21314438:21314697:-@chr9:21313375:21313558:-@chr9:21311823:21312835:- chr9:21314438:21314697:-@chr9:21313375:21313795:-@chr9:21311823:21312835:- chr10:79545360:79545471:-@chr10:79542698:79544127:-@chr10:79533365:79535263:- chr17:5975579:5975881:+@chr17:5985972:5986242:+@chr17:5990136:5990361:+ chr2:29997782:29997941:+@chr2:30002172:30002382:+@chr2:30002882:30003045:+ chr7:119221306:119221473:+@chr7:119223686:119223745:+@chr7:119225944:119226075:+
gene Os9 Vta1 Bclaf1 Bclaf1 P4ha1 Bsg Ptbp1 Igf1 Elk3 Nbr1 ... Afg3l1 Tep1 Fgd2 Ttc17 Tmed1 Tmed1 Sbno2 Synj2 Tbc1d13 Usp47
S1 0.84 0.95 NaN 0.02 0.42 NaN 0.57 0.31 0.93 0.57 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S2 NaN NaN 0.04 0.98 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S3 NaN NaN 0.02 0.55 NaN NaN NaN 0.20 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S4 NaN 0.84 NaN NaN NaN NaN NaN 0.95 NaN 0.04 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
S5 NaN 0.95 NaN NaN 0.94 NaN NaN 0.73 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 352 columns

Metadata -------- Now let's get into creating a metadata dataframe. We'll use the index from the ``expression_filtered`` data to create the minimum required column, ``'phenotype'``, which has the name of the phenotype of that cell. And we'll also add the column ``'pooled'`` to indicate whether this sample is pooled or not. .. code:: python metadata = pd.DataFrame(index=expression_filtered.index) metadata['phenotype'] = 'BDMC' metadata['pooled'] = metadata.index.map(lambda x: x.startswith('P')) metadata .. raw:: html
phenotype pooled
S1 BDMC False
S2 BDMC False
S3 BDMC False
S4 BDMC False
S5 BDMC False
S6 BDMC False
S7 BDMC False
S8 BDMC False
S9 BDMC False
S10 BDMC False
S11 BDMC False
S12 BDMC False
S13 BDMC False
S14 BDMC False
S15 BDMC False
S16 BDMC False
S17 BDMC False
S18 BDMC False
P1 BDMC True
P2 BDMC True
P3 BDMC True
Mapping stats data ------------------ .. code:: python mapping_stats = pd.read_excel('nature12172-s1/Supplementary_Table1.xls', sheetname='SuppTable1 2.txt') mapping_stats .. raw:: html
Sample PF_READS PCT_MAPPED_GENOME PCT_RIBOSOMAL_BASES MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS MEDIAN_5PRIME_TO_3PRIME_BIAS
0 S1 21326048 0.706590 0.006820 0.509939 0.092679 0.477321 0.247741
1 S2 27434011 0.745385 0.004111 0.565732 0.056583 0.321053 0.244062
2 S3 31142391 0.722087 0.006428 0.540341 0.079551 0.382286 0.267367
3 S4 26231852 0.737854 0.004959 0.530978 0.067041 0.351670 0.279782
4 S5 29977214 0.746466 0.006121 0.525598 0.066543 0.353995 0.274252
5 S6 24148387 0.730079 0.008794 0.529650 0.072095 0.413696 0.225929
6 S7 24078116 0.730638 0.007945 0.540913 0.051991 0.358597 0.201984
7 S8 25032126 0.739989 0.004133 0.512725 0.058783 0.373509 0.212337
8 S9 22257682 0.747427 0.004869 0.521622 0.063566 0.334294 0.240641
9 S10 29436289 0.748795 0.005499 0.560454 0.036219 0.306729 0.187479
10 S11 31130278 0.741882 0.002740 0.558882 0.049581 0.349191 0.211787
11 S12 21161595 0.750782 0.006837 0.756339 0.013878 0.324264 0.195430
12 S13 28612833 0.733976 0.011718 0.598687 0.035392 0.357447 0.198566
13 S14 26351189 0.748323 0.004106 0.517518 0.070293 0.381095 0.259122
14 S15 25739575 0.748421 0.003353 0.526238 0.050938 0.324207 0.212366
15 S16 26802346 0.739833 0.009370 0.520287 0.071503 0.358758 0.240009
16 S17 26343522 0.749358 0.003155 0.673195 0.024121 0.301588 0.245854
17 S18 25290073 0.749358 0.007465 0.562382 0.048528 0.314776 0.215160
18 10k_rep1 28247826 0.688553 0.018993 0.547000 0.056113 0.484393 0.140333
19 10k_rep2 39303876 0.690313 0.017328 0.547621 0.055600 0.474634 0.142889
20 10k_rep3 29831281 0.710875 0.010610 0.518053 0.066053 0.488738 0.168180
21 MB_SC1 13848219 0.545000 0.007000 0.531495 0.127934 0.207841 0.728980
22 MB_SC2 13550218 0.458000 0.010800 0.569271 0.102581 0.179407 0.694747
23 MB_SC3 26765848 0.496000 0.007900 0.535192 0.141893 0.231068 0.722080
Create a ``flotilla`` Study! ---------------------------- .. code:: python study = flotilla.Study(# The metadata describing phenotype and pooled samples metadata, # A version for this data version='0.1.0', # Dataframe of the filtered expression data expression_data=expression_filtered, # Dataframe of the feature data of the genes expression_feature_data=expression_feature_data, # Dataframe of the splicing data splicing_data=splicing, # Dataframe of the mapping stats data mapping_stats_data=mapping_stats, # Which column in "mapping_stats" has the number of reads mapping_stats_number_mapped_col='PF_READS') .. parsed-literal:: 2014-11-04 14:19:33 Initializing Study 2014-11-04 14:19:33 Initializing Predictor configuration manager for Study 2014-11-04 14:19:33 Predictor ExtraTreesClassifier is of type 2014-11-04 14:19:33 Added ExtraTreesClassifier to default predictors 2014-11-04 14:19:33 Predictor ExtraTreesRegressor is of type 2014-11-04 14:19:33 Added ExtraTreesRegressor to default predictors 2014-11-04 14:19:33 Predictor GradientBoostingClassifier is of type 2014-11-04 14:19:33 Added GradientBoostingClassifier to default predictors 2014-11-04 14:19:33 Predictor GradientBoostingRegressor is of type 2014-11-04 14:19:33 Added GradientBoostingRegressor to default predictors 2014-11-04 14:19:33 Loading metadata 2014-11-04 14:19:33 Loading expression data 2014-11-04 14:19:33 Initializing expression 2014-11-04 14:19:33 Done initializing expression 2014-11-04 14:19:33 Loading splicing data 2014-11-04 14:19:33 Initializing splicing 2014-11-04 14:19:33 Done initializing splicing 2014-11-04 14:19:33 Successfully initialized a Study object! .. parsed-literal:: No phenotype to color mapping was provided, so coming up with reasonable defaults No phenotype to marker (matplotlib plotting symbol) was provided, so each phenotype will be plotted as a circle in the PCA visualizations. samples had too few mapped reads (<5.0e+05 reads): Removing technical outliers from consideration in expression: Removing technical outliers from consideration in splicing: As a side note, you can save this study to disk now, so you can "``embark``" later: .. code:: python study.save('shalek2013') .. parsed-literal:: Wrote datapackage to /Users/olga/flotilla_projects/shalek2013/datapackage.json Note that this is saved to my home directory, in ``~/flotilla_projects//``. This will be saved in your home directory, too. The ``datapackage.json`` file is what holds all the information relative to the study, and loosely follows the `datapackage spec `_ created by the Open Knowledge Foundation. .. code:: python cat /Users/olga/flotilla_projects/shalek2013/datapackage.json .. parsed-literal:: { "name": "shalek2013", "title": null, "datapackage_version": "0.1.1", "sources": null, "licenses": null, "resources": [ { "path": "/Users/olga/flotilla_projects/shalek2013/splicing.csv.gz", "format": "csv", "name": "splicing", "compression": "gzip" }, { "number_mapped_col": "PF_READS", "path": "/Users/olga/flotilla_projects/shalek2013/mapping_stats.csv.gz", "format": "csv", "name": "mapping_stats", "compression": "gzip" }, { "name": "expression_feature", "format": "csv", "rename_col": null, "ignore_subset_cols": [], "path": "/Users/olga/flotilla_projects/shalek2013/expression_feature.csv.gz", "compression": "gzip" }, { "name": "expression", "log_base": null, "format": "csv", "thresh": -Infinity, "path": "/Users/olga/flotilla_projects/shalek2013/expression.csv.gz", "compression": "gzip" }, { "name": "splicing_feature", "format": "csv", "rename_col": "gene_name", "ignore_subset_cols": [], "path": "/Users/olga/flotilla_projects/shalek2013/splicing_feature.csv.gz", "compression": "gzip" }, { "pooled_col": "pooled", "name": "metadata", "phenotype_to_marker": { "BDMC": "o" }, "format": "csv", "minimum_samples": 0, "phenotype_to_color": { "BDMC": "#1b9e77" }, "path": "/Users/olga/flotilla_projects/shalek2013/metadata.csv.gz", "phenotype_col": "phenotype", "phenotype_order": [ "BDMC" ], "compression": "gzip" } ] } One thing to note is that when you save, the version number is bumped up. ``study.version`` (the one we just made) is ``0.1.0``, but the one we saved is ``0.1.1``, since we could have made some changes to the data. Let's look at what else is in this folder: .. code:: python ls /Users/olga/flotilla_projects/shalek2013 .. parsed-literal:: datapackage.json expression_feature.csv.gz metadata.csv.gz splicing_feature.csv.gz expression.csv.gz mapping_stats.csv.gz splicing.csv.gz So this is where all the other files are. Good to know! We can "embark" on this newly-saved study now very painlessly, without having to open and process all those files again: .. code:: python study2 = flotilla.embark('shalek2013') .. parsed-literal:: 2014-10-30 12:16:56 Reading datapackage from /Users/olga/flotilla_projects/shalek2013/datapackage.json 2014-10-30 12:16:56 Parsing datapackage to create a Study object 2014-10-30 12:16:56 Initializing Study 2014-10-30 12:16:56 Initializing Predictor configuration manager for Study 2014-10-30 12:16:56 Predictor ExtraTreesClassifier is of type 2014-10-30 12:16:56 Added ExtraTreesClassifier to default predictors 2014-10-30 12:16:56 Predictor ExtraTreesRegressor is of type 2014-10-30 12:16:56 Added ExtraTreesRegressor to default predictors 2014-10-30 12:16:56 Predictor GradientBoostingClassifier is of type 2014-10-30 12:16:56 Added GradientBoostingClassifier to default predictors 2014-10-30 12:16:56 Predictor GradientBoostingRegressor is of type 2014-10-30 12:16:56 Added GradientBoostingRegressor to default predictors 2014-10-30 12:16:56 Loading metadata 2014-10-30 12:16:56 Loading expression data 2014-10-30 12:16:56 Initializing expression 2014-10-30 12:16:56 Done initializing expression 2014-10-30 12:16:56 Loading splicing data 2014-10-30 12:16:56 Initializing splicing 2014-10-30 12:16:56 Done initializing splicing 2014-10-30 12:16:56 Successfully initialized a Study object!