TSCC

TSCC houses our 640-core supercomputer as part of a condo resource sharing system which allows other researchers (mostly bioinformaticians from the Ideker, Ren, Zhang labs) to use our portion for their jobs (with an 8 hour time limit) which allows us to use their portions when we need extra computing power. We lead the pack in terms of pure crunching power, with the Ideker lab in close second place with 512 cores. But they have 2x the RAM per core for jobs that require lots of memory, so our purchases are complimentary and sharing is encouraged.

The main contacts for questions about TSCC are the dry lab and TSCC users mailing lists. The main contact for problems with TSCC is Jim Hayes

Important rules

Warning

  1. All sequencing data is stored in the /projects/ps-yeolab/seqdata folder
  2. The folder seqdata is intended as permanent storage and no folders or files there should ever be deleted
  3. Do not process data in seqdata. Use the directory structure described in Organize your home directory to create a scratch folder for all data processing.

First Steps

Your first login session should include some of the following commands, which will familiarize you with the cluster, teach you how to do some useful tasks on the queue, and help you set up a common directory structure shared by everyone in the lab.

Log on!

First, log in to TSCC!

ssh YOUR_TSCC_USERNAME@tscc-login2.sdsc.edu

TSCC has two login nodes, login1 and login2 for load-balancing (i.e. so if you just log on to tscc.sdsc.edu, it’ll choose whichever login node is less occupied. We’re logging in to a specific node because then we’ll always have our screen session on the same node) This is logging specifically on to login2. You can do login1 if you like, as well, to balance it out :)

Start a screen session

NOTE - You can skip this on your initial setup, but you should come back and do this later because it is cool.

Screen is an awesome tool which allows you to have multiple “tabs” open in the same login session, and you can easily transition between screens. Plus, they’re persistent, so you can leave something running in a screen session, log out of TSCC, and it will still be running! Amazing! If you have suggestions of things to add to this .screenrc, feel free to make a pull request on Olga’s rcfiles github repo.

To get a nice status bar at the bottom of your terminal window, get this .screenrc file:

cd
wget https://raw.githubusercontent.com/olgabot/rcfiles/master/.screenrc

Note

The control letter is j, not a in the documentation above, so for example to create a new window, do Ctrl-j c and to kill the current window, do Ctrl-j k. Do Ctrl-j j to switch between windows, and Ctrl-j #, where # is some window number, to switch to a numbered tab specifically (e.g. Ctrl-j 2 to switch to tab number 2.

This .screenrc adds a status bar at the bottom of your screen, like this:

_images/screen_status.png

Now to start a screen session do:

screen

If you’re re-logging in and you have an old screen session, do this to “re-attach” the screen window.

screen -x

Every time you log in to TSCC, you’ll want to reattach the screens from before, so the first step I always take when I log in to TSCC is exactly that, screen -x.

Get gscripts access to software

  1. Before you’re able to clone the gscripts github repo, you’ll need to add your ssh keys on TSCC to your Github account. Follow Github’s instructions for generating SSH keys.
  2. First, clone the gscripts github repo to your home directory on TSCC (this assumes you’ve already created a github account).
# <on TSCC>
cd
git clone http://github.com/YeoLab/gscripts
  1. Add this line to the end of your ~/.bashrc file (using either emacs or vi/vim, your choice)
source ~/gscripts/bashrc/tscc_bash_settings_current

Note

Make sure to add source ~/gscripts/bashrc/tscc_bash_settings_current to your ~/.bashrc file so that it always loads up the correct yeolab environment variables!

  1. “source the .bashrc file so you load all the convenient environment

    variables we’ve created.

source ~/.bashrc

Download and install anaconda

Download the Anaconda Python/R package manager using wget (web-get). The link below is from the Anaconda downloads page.

wget http://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh

To install ANaconda, run the shell script with bash (this will take some time). It will ask you a bunch of questions, and use the defaults for them (press enter for all)

bash Anaconda2-4.1.1-Linux-x86_64.sh

To activate anaconda, source your .bashrc:

source ~/.bashrc

Make sure your Python is point to the Anaconda python with:

which python

The output should look something like:

~/anaconda2/bin/python

Make a virtual environment on TSCC

WARNING - this is easy to get messed up. While this is a nice tool, it is not absolutely necessary upon initial setup and might be best to wait and configure environments after you have a better understanding of how they work.

On TSCC, the easiest way to create a virtual evironment (aka virtualenv) is by making one off of the base environment, which already has a bunch of modules that we use all the time (numpy, scipy, matplotlib, pandas, scikit-learn, ipython, the list goes on). Here’s how you do it:

Note

The command $USER is meant to be literal, meaning you can exactly copy the below command, and TSCC will create an environment with your username. If you don’t believe me, compare the output of:

echo USER

to the output of:

echo $USER

The second one should output your TSCC username, because the $ dollar sign indicates to the shell that you’re asking for the variable $USER, not the literal word “USER”.

conda create --clone base --name $USER

Note

You can also create an environment from scratch using conda to install all the Anaconda Python packages, and then using pip in the environment to install the remaining packages, like so:

conda create --yes --name ENVIRONMENT_NAME pip numpy scipy cython matplotlib nose six scikit-learn ipython networkx pandas tornado statsmodels setuptools pytest pyzmq jinja2 pyyaml pymongo biopython markupsafe seaborn joblib semantic_version
source activate ENVIRONMENT_NAME
conda install --yes --channel https://conda.binstar.org/daler pybedtools
conda install --yes --channel https://conda.binstar.org/kyleabeauchamp fastcluster
pip install gspread brewer2mpl husl gffutils matplotlib-venn HTSeq misopy
pip install https://github.com/YeoLab/clipper/tarball/master
pip install https://github.com/YeoLab/gscripts/tarball/master
pip install https://github.com/YeoLab/flotilla/tarball/master

These commands is how the base environment was created.

Then activate your environment with

source activate $USER

You’ll probably stay in this environment all the time.

Warning

Make sure to add source activate $USER to your ~/.bashrc file! Then you will always be in your environment

If you need to switch to another environment, then exit your environment with:

source deactivate

Note

Now that you’ve created your own environment, go to your gscripts folder and install your own personal gscripts, to make sure it’s the most updated version.

cd ~/gscripts
pip install .  # The "." means install "this," as in "this folder where I am"

Add the location of GENOME to your ~/.bashrc

To run the analysis pipeline, you will need to specify where the genomes are on TSCC, and you can do this by adding this line to your ~/.bashrc:

GENOME=/projects/ps-yeolab/genomes

Organize your home directory

Create an organized home directory structure following a common template, so others can find your scripts, workflows, and even final results/papers! Do not store actual data in your home directory as is is limited to 100 GB only.

Create workflow and projects folders

Create ~/workflows for your personal bash, makefile, queue, and so on, scripts, before you add them to gscripts, and ~/projects for your projects to organize figures, notebooks, final results, and even manuscripts.

mkdir ~/workflows ~/projects

Here’s an example project directory structure:

$ ls -lha /home/gpratt/projects/fox2_iclip/
total 9.5K
drwxr-xr-x  2 gpratt yeo-group  5 Sep 16  2013 .
drwxr-xr-x 40 gpratt yeo-group 40 Nov 24 12:20 ..
lrwxrwxrwx  1 gpratt yeo-group 49 Aug 21  2013 analysis -> /home/gpratt/scratch/projects/fox2_iclip/analysis
lrwxrwxrwx  1 gpratt yeo-group 45 Aug 21  2013 data -> /home/gpratt/scratch/projects/fox2_iclip/data
lrwxrwxrwx  1 gpratt yeo-group 50 Aug 21  2013 scripts -> /home/gpratt/processing_scripts/fox2_iclip/scripts

Note

Notice that all of these are soft-links to either ~/scratch or some other processing scripts.

Let us see your stuff

Make everything readable by other yeo lab members and restrict access from other users (per HIPAA/HITECH requirements)

chmod -R g+r ~/
chmod -R g+r ~/scratch/
chmod -R o-rwx ~/
chmod -R o-rwx ~/scratch/

But git will get mad at you if your ~/.ssh keys private keys are visible by others, so make them visible to only you via:

chmod -R go-rwx ~/.ssh/

In the end, your ‘’‘home’‘’ directory should look something like this:

$ ls -l $HOME
lrwxrwxrwx  1 bkakarad yeo-group    29 Jun 24  2013 scratch -> /oasis/tscc/scratch/bkakarad/
drwxr-x---+ 2 bkakarad yeo-group     2 Jun 24  2013 gscripts
drwxr-x---+ 3 bkakarad yeo-group     3 Jun 24  2013 projects
drwxr-x---+ 2 bkakarad yeo-group     2 Jun 24  2013 workflows

IPython notebooks on TSCC

This has two sections: Setup and Running. They should be done in order :)

Setup IPython notebooks on TSCC

  1. First, on your personal computer, you will want to set up passwordless ssh from your laptop to TSCC. For reference, a@A is you from your laptop, and b@B is TSCC. So everywhere you see b@B, replace that with yourusername@tscc.sdsc.edu. For a@A, since your laptop likely doesn’t have a fixed IP address or a way to log in to it, you don’t need to worry about replacing it. Instead, use a@A as a reference point for whether you should be doing the command from your laptop (a@A) or TSCC (b@B)
  2. To set up IPython notebooks on TSCC, you will want to add some alias variables to your ~/.bash_profile (for Mac) or ~/.bashrc (for Linux)
IPYNB_PORT=[some number above 1024]
alias tscc='ssh obotvinnik@tscc-login2.sdsc.edu'

This way, I can just type tscc and log onto tscc-login2 specifically. It is important for IPython notebooks that you always log on to the same node. You can use tscc-login1 instead, too, this is just what I have set up. Just replace my login name (“obotvinnik”) with yours.

  1. To activate all the commands you just added, on your laptop, type source ~/.bash_profile. (source is a command which will run all the lines in the file you gave it, i.e. here it will assign the variable IPYNB_PORT to the value you gave it, and run the alias command so you only have to type tscc to log in to TSCC)
  1. Next, type tscc and log on to the server.

  2. On TSCC, add these lines to your ~/.bashrc file.

    IPYNB_PORT=same number as the above IPYNB_PORT from your laptop
    alias ipynb="ipython notebook --no-browser --port $IPYNB_PORT &"
    alias sshtscc="ssh -NR $IPYNB_PORT:localhost:$IPYNB_PORT tscc-login2 &"
    

    Notice that in sshtscc, I use the same port as I logged in to, tscc-login2. The ampersands “&” at the end of the lines tell the computer to run these processes in the background, which is super useful.

  3. You’ll need to run source ~/.bashrc again on TSCC, so the $IPYNB_PORT variable, and ipynb, sshtscc aliases are available.

  4. Set up passwordless ssh between the compute nodes and TSCC with:

cat .ssh/id_rsa.pub >> .ssh/authorized_keys
  1. Back on your home laptop, edit your ~/.bash_profile on macs, ~/.bashrc for other unix machines to add the line:

    alias tunneltscc="ssh -NL $IPYNB_PORT\:localhost:$IPYNB_PORT obotvinnik@tscc-login2.sdsc.edu &"
    

    Make sure to replace “obotvinnik” with your TSCC login :) It is also important that these are double-quotes and not single-quotes, because the double-quotes evaluate the $IPYNB_PORT to the number you chose, e.g. 4000, whereas the single-quotes will keep it as the letters $IPYNB_PORT.

Run IPython Notebooks on TSCC

Now that you have everything configured, you can run IPython notebooks on TSCC! Here are the steps to follow.

1. Log on to TSCC 4. Now that you have those set up, start up a screen session, which allows you to have something running continuously, without being logged in.

screen -x

Note

If this gives you an error saying “There is no screen to be attached” then you need to run plain old screen (no -x) first.

If this gives you an error saying you need to pick one session, make life easier for yourself and pick one to kill all the windows in, (using Ctrl-j K if you’re using the .screenrc that I recommended earlier, otherwise the default is Ctrl-a K). Once you’ve killed all screen sessions except for one, you can run screen -x with abandon, and it will connect you to the only one you have open.

  1. In this screen session, now request an interactive job, e.g.:
qsub -I -l walltime=2:00:00 -q home-yeo -l nodes=1:ppn=2
  1. Wait for the job to start.
  2. Run your TSCC-specific aliases on the compute node:
ipynb
sshtscc
  1. Back on your laptop, now run your tunneling command:
tunneltscc
  1. Open up http://localhost:[YOUR IPYNB PORT] on your browser.

Share your Dropbox account for easy figure syncing

This is not necessary, but a nice feature to have. For the initial setup, I would skip this step.

Instructions for this can be found here: http://www.dropboxwiki.com/tips-and-tricks/using-the-official-dropbox-command-line-interface-cli

  1. Make a bin directory in your home
mkdir -p ~/bin
  1. Download the CLI interface via wget:
wget -O ~/bin/dropbox.py "https://www.dropbox.com/download?dl=packages/dropbox.py"
  1. set the permissions so you can execute the CLI interface:
chmod +x ~/bin/dropbox.py
  1. Run the script with help to check if it is working. A help menu should come up.
~/bin/dropbox.py help
  1. Now download the Dropbox daemon for 64-bit Linux and run it using the following two commands: (more information found here https://www.dropbox.com/install?os=lnx)
cd ~ && wget -O - "https://www.dropbox.com/download?plat=lnx.x86_64" | tar xzf -

~/.dropbox-dist/dropboxd
  1. This should generate a link. Paste the link into a web browser and open your dropbox. This will complete the sync. Now try it by running the following dropbox start command on TSCC.
~/bin/dropbox.py start

All done! If everything worked properly you should now be able to view your dropbox files on tscc located in the newly created Dropbox directory located in your home. Files saved to that location can also be viewed in the web browser version of Dropbox.

NOTE: If you have a lot of stuff in your dropbox it may be too big to sync without using a interactive job. If you see the error listed below, try starting an interactive job FIRST and then re running the dropbox start command.

Fatal python error: couldn't create thread state for new thread

Installing and upgrading Python packages

To install Python packages first try conda install:

conda install <package name>

If there is no package in conda, then try bioconda (a google search for your package along with the keyword “bioconda” will tell you if this is available):

conda install -c bioconda <package name>

If there is no package in conda, then (and ONLY then) try pip:

pip install <package name>

To upgrade packages, do:

(using conda)

conda update <package name>

(using pip)

pip install -U <package name>

NOTE - you can see if your package is correctly installed in your anaconda with:

which <package name>

Alternatively, you can open python on your command line with:

python

And then try to import the package you just installed. If it doesn’t throw an error, it installed successfully!

import <package name>

To get out of python on your command line:

quit()

Installing R packages (beta!)

You can also use conda to install R and R packages. This will allow you to access a jupyter notebook in R rather than python code which can be VERY helpful for some analysis software that runs in R.

conda install -c r r-essentials

After the install, load your jupyter notebooks and start a new notebook. You should see options avilable to choose between Python2 and R.

Submitting and managing compute jobs on TSCC

Submit jobs

To submit a script that you wrote, in this case called myscript.sh, to TSCC, do:

qsub -q home-yeo -l nodes=1:ppn=2 -l walltime=0:30:00 myscript.sh

Submit interactive jobs

To submit interactive jobs, do:

qsub -I -q home-yeo -l nodes=1:ppn=2 -l walltime=0:30:00

Submit jobs to home-scrm

To submit to the home-scrm queue, add -W group_list=scrm-group to your qsub command:

qsub -I -l walltime=0:30:00 -q home-scrm -W group_list=scrm-group

Submitting many jobs at once

If you have a bunch of commands you want to run at once, you can use this script to submit them all at once. In the next example, commands.sh is a file has the commands you want on their own line, i.e. one command per line.

java -Xms512m -Xmx512m -jar /home/yeo-lab/software/gatk/dist/Queue.jar \
-S ~/gscripts/qscripts/do_stuff.scala --input commands.sh -run -qsub \
-jobQueue <queue> -jobLimit <n> --ncores <n> --jobname <name> -startFromScratch

This runs a scala job that submits sub-jobs to the PBS queue under name you fill in where <name> now sits as a placeholder.

Check job status, aka “why is my job stuck?”

Check the status of your jobs:

qme

Note

This will only work if you have followed instructions and have

source‘d the ~/gscripts/tscc_bash_settings_current :)

qme outputs,

(olga)[obotvinnik@tscc-login2 ~]$ qme

tscc-mgr.sdsc.edu:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
2006527.tscc-mgr.local  obotvinnik  home-yeo STDIN             35367     1     16    --   04:00:00 R  02:35:36
2007542.tscc-mgr.local  obotvinnik  home-yeo STDIN              6168     1      1    --   08:00:00 R  00:28:08
2007621.tscc-mgr.local  obotvinnik  home-yeo STDIN               --      1     16    --   04:00:00 Q       --

Check job status of array jobs

Check the status of your array jobs, you need to specify -t to see the status of the individual array pieces.

qstat -t

Killing jobs

If you have a job you want to stop, kill it with qdel JOBID, e.g.

qdel 2006527

Kill an array job

If the job is an array job, you’ll need to add brackets, like this:

qdel 2006527[]

Kill all your jobs

To kill all the jobs that you’ve submitted, do:

qdel $(qselect -u $USER)

Which queue do I submit to? (check status of queues)

Check the status of the queue (so you know which queues to NOT submit to!)

qstat -q

Example output is,

(olga)[obotvinnik@tscc-login2 ~]$ qstat -q

server: tscc-mgr.local

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
home-dkeres        --      --       --      --    2   0 --   E R
home-komunjer      --      --       --      --    0   0 --   E R
home-ong           --      --       --      --    2   0 --   E R
home-tg            --      --       --      --    0   0 --   E R
home-yeo           --      --       --      --    3   1 --   E R
home-visres        --      --       --      --    0   0 --   E R
home-mccammon      --      --       --      --   15  29 --   E R
home-scrm          --      --       --      --    1   0 --   E R
hotel              --      --    168:00:0   --  232  26 --   E R
home-k4zhang       --      --       --      --    0   0 --   E R
home-kkey          --      --       --      --    0   0 --   E R
home-kyang         --      --       --      --    2   1 --   E R
home-jsebat        --      --       --      --    1   0 --   E R
pdafm              --      --    72:00:00   --    1   0 --   E R
condo              --      --    08:00:00   --   18   6 --   E R
gpu-hotel          --      --    336:00:0   --    0   0 --   E R
glean              --      --       --      --   24  75 --   E R
gpu-condo          --      --    08:00:00   --   16  36 --   E R
home-fpaesani      --      --       --      --    4   2 --   E R
home-builder       --      --       --      --    0   0 --   E R
home               --      --       --      --    0   0 --   E R
home-mgilson       --      --       --      --    0   4 --   E R
home-eallen        --      --       --      --    0   0 --   E R
                                               ----- -----
                                                 321   180

So right now is not a good time to submit to the hotel queue, since it has a bunch of both running and queued jobs!

Show available “Service Units”

List the available Service Units (1 SU = 1 core*hour) ... for a quick ego boost. Also note that our supercomputer is separated in two: yeo-group and scrm-group, but the total balance is 5.29 million SU, just enough secure us the top honors :-)

gbalance | sort -nrk 3 | head

Id Name                 Amount  Reserved Balance CreditLimit Available
-- -------------------- ------- -------- ------- ----------- ---------
19 tideker-group        5211035    27922 5183113           0   5183113
82 yeo-group            3262925        0 3262925           0   3262925
81 scrm-group           2039328        0 2039328           0   2039328
14 mgilson-group         663095   208000  455095           0    455095
73 nanosprings-ucm       650000        0  650000           0    650000
17 kkey-group            635056     7104  627952           0    627952
16 k4zhang-group         534430        0  534430           0    534430

List the available TORQUE queues, for a quick boost in motivation!

qstat -q

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
home-tideker       --      --       --       16   1   0 --   E R
home-visres        --      --       --        1   0   0 --   E R
hotel              --      --    72:00:00   --   25  18 --   E R
home-k4zhang       --      --       --        4  21   0 --   E R
home-kkey          --      --       --        5   0   0 --   E R
pdafm              --      --    72:00:00   --    0   0 --   E R
condo              --      --    08:00:00   --    0   0 --   E R
glean              --      --       --      --    0   0 --   E R
home-builder       --      --       --        8   0   0 --   E R
home               --      --       --      --    0   0 --   E R
home-ewyeo         --      --       --       15   0   0 --   E R
home-mgilson       --      --       --        8   0   0 --   E R
                                           ----- -----
                                              47    18

Show available processors

To show available processors, do

showbf

Show specs of all nodes

pbsnodes -a

Uploading Data to Amazon S3 buckets

  1. Create an AWS AIM user

Sign into AWS Console https://console.aws.amazon.com/ and click on Identity & Access Management (IAM)

  • On the left, click Users
  • At the top, click Create New Users
  • Create a username and generate an access key
  • Click Create in the lower right.
  • On the next page, click Download Credentials in the lower right to download the access key for your new user
  • Safeguard the downloaded file. It contains a key which is like a password for the AWS account.
  • Click Close in the lower right.
  • While still in the IAM Users page, cluck on the newly-created user’s name.
  • Click the Permissions tab
  • Under Manages Policies, click Attach Policy
  • Type “S3” in the Gilter fiels and check the AmazonS3FullAccess policy.
  • Click Attach Polucy in the power right.
  • Now your username is setip to use AWS S3 services.
  1. Download and install the AWS Command Line Interface (CLI)
..code::
aws configure
  • Fill in the AWS Access Key ID and AWS Secret Access Key when prompted using the information downloaded in the credentials file above.
  • Pick an appropriate region name: http://docs.aws.amazon.com/general/latest/gr/rande.html
  • I used the default output format by not entering any information when prompted at that step.
  1. Create a new Bucket
  • Login to your AWS accounts. Click on Services -> S3
  • In the upper left hand corner, click “Create Bucket”
  • Name your bucket as you please!
  1. From the command line, upload your data to the newly created bucket.

..code:

aws s3 sync --acl bucket-owner-full-control --sse AES256 /home/ecwheele/scratch/mds_splicing_v4/final_bams/ s3://mds-splicing-v4-bams

Random notes

Software goes in /projects/ps-yeolab/software

Make sure to recursively set group read/write permissions to the software directory so others can use and update the common software, using:

chmod ug+rw /projects/ps-yeolab/software

If your’e installing something from source and using ./configure and make and all that, then always set the flag --prefix=/projects/ps-yeolab/software when you run ./configure

./configure --prefix /projects/ps-yeolab/software

When possible install bins to /projects/ps-yeolab/software/bin

Running RNA-seq, CLIP-Seq, Ribo-Seq, etc qscripts GATK Queue pipelines

We use the Broad Institute’s Genome Analysis Toolkit (GATK) Queue software to run our pipelines. This software solves a lot of problems for us, such as dealing with multiple-stage pipelines that have cross-dependencies (e.g. you can’t calculate splicing until you’ve mapped the reads, and you can’t map the reads until after you’ve removed adapters and repetitive genomic regions from them), and properly scheduling jobs so that one person’s analysis doesn’t completely take over the compute cluster.

Gabe has created a bunch of helpful template scripts for GATK Queue in his folder /home/gpratt/templates:

$ ls -lh /home/gpratt/templates
total 26K
-rwxr-xr-x 1 gpratt yeo-group 660 May  7  2014 bacode_split.sh
-rwxr-xr-x 1 gpratt yeo-group 554 May  7  2014 bacode_split.sh~
-rwxr-xr-x 1 gpratt yeo-group 524 Sep 18 00:08 #clipseq.sh#
-rwxr-xr-x 1 gpratt yeo-group 524 Jul 12  2014 clipseq.sh
-rwxr-xr-x 1 gpratt yeo-group 516 Mar 26  2014 clipseq.sh~
-rwxr-xr-x 1 gpratt yeo-group 473 Aug 21 18:47 riboseq.sh
-rwxr-xr-x 1 gpratt yeo-group 528 Aug 21 18:46 riboseq.sh~
-rwxr-xr-x 1 gpratt yeo-group 530 Sep  5 17:29 rnaseq.sh
-rwxr-xr-x 1 gpratt yeo-group 527 Mar 26  2014 rnaseq.sh~

Each Queue job requires a manifest file with a list of all files to process, and the genome to process them on.

Warning

All further instructions depend on you having followed the directions in Create workflow and projects folders, where for this particular project, you’ve created these folders:

~/projects/PROJECT_NAME
~/processing_scripts/PROJECT_NAME/scripts
~/scratch/PROJECT_NAME/data
~/scratch/PROJECT_NAME/analysis

And that you’ve linked the scratch and home directories correctly. For example, here’s how you can create the project directory structure for a project called singlecell_pnms:

NAME=singlecell_pnms
mkdir -p ~/projects/$NAME ~/scratch/$NAME ~/scratch/$NAME/data ~/scratch/$NAME/analysis ~/processing_scripts/$NAME/scripts
ln -s ~/scratch/$NAME/data ~/projects/$NAME/data
ln -s ~/scratch/$NAME/analysis ~/projects/$NAME/analysis
ln -s ~/processing_scripts/$NAME/scripts ~/projects/$NAME/scripts

Here’s an example queue script for single-end, not strand-specific RNA-seq, from the file singlecell_pnms_se_v3.sh:

#!/bin/bash

NAME=singlecell_pnms_se
VERSION=v3
DIR=singlecell_pnms
java -Xms512m -Xmx512m -jar /projects/ps-yeolab/software/gatk/dist/Queue.jar -S $HOME/gscripts/qscripts/analyze_rna_seq.scala --input ${NAME}_${VERSION}.txt --adapter TCGTATGCCGTCTTCTGCTTG --adapter ATCTCGTATGCCGTCTTCTGCTTG --adapter CGACAGGTTCAGAGTTCTACAGTCCGACGATC --adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -qsub -jobQueue home-yeo -runDir ~/projects/${DIR}/analysis/${NAME}_${VERSION}  -log ${NAME}_${VERSION}.log --location ${NAME}  --strict -keepIntermediates --not_stranded -single_end -run

Notice that the “--input” is the file ${NAME}_${VERSION}.txt, which translates to singlecell_pnms_se_v3.txt in this case, since NAME=singlecell_pnms_se and VERSION=v3 are defined at the beginning of the file. This file is the “manifest” of the sequencing run. In the case of single-end reads, this is a file where each line has, /path/to/read1.fastq.gz\tspecies\n, where \t indicates a tab (using the <TAB> character), and \n indicates a new line, created by <ENTER>. Here is the first 10 lines of singlecell_pnms_se_v3.txt (obtained via head singlecell_pnms_se_v3.txt):

/home/obotvinnik/projects/singlecell_pnms/data/CVN_01_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_02_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_03_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_04_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_05_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_06_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_07_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_08_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_09_R1.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/CVN_10_R1.fastq.gz       hg19

For paired-end, not strand-specific RNA-seq, here’s the script of the file singlecell_pnms_pe_v3.sh

#!/bin/bash

NAME=singlecell_pnms_pe
VERSION=v3
DIR=singlecell_pnms
java -Xms512m -Xmx512m -jar /projects/ps-yeolab/software/gatk/dist/Queue.jar -S $HOME/gscripts/qscripts/analyze_rna_seq.scala --input ${NAME}_${VERSION}.txt --adapter TCGTATGCCGTCTTCTGCTTG --adapter ATCTCGTATGCCGTCTTCTGCTTG --adapter CGACAGGTTCAGAGTTCTACAGTCCGACGATC --adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -qsub -jobQueue home-yeo -runDir ~/projects/${DIR}/analysis/${NAME}_${VERSION}  -log ${NAME}_${VERSION}.log --location ${NAME}  --strict -keepIntermediates --not_stranded -run

Notice that the “--input” is the file ${NAME}_${VERSION}.txt, which translates to singlecell_pnms_pe_v3.txt in this case, since NAME=singlecell_pnms_pe and VERSION=v2 are defined at the beginning of the file. This file is the “manifest” of the sequencing run. In the case of single-end reads, this is a file where each line has: read1.fastq.gz;read2.fastq.gz\tspecies\n, where \t indicates a tab (using the <TAB> character), and \n indicates a new line, created by <ENTER>. Here is the first 10 lines of singlecell_pnms_pe_v3.txt (obtained via head singlecell_pnms_pe_v3.txt):

/home/obotvinnik/projects/singlecell_pnms/data/M1_01_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_01_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_02_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_02_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_03_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_03_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_04_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_04_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_05_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_05_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_06_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_06_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_07_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_07_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_08_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_08_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_09_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_09_R2.fastq.gz       hg19
/home/obotvinnik/projects/singlecell_pnms/data/M1_10_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_10_R2.fastq.gz       hg19

For this project, I had a mix of both paired-end and single-end reads, so that’s why DIR is the same for both the singlecell_pnms_se_v3.sh and singlecell_pnms_pe_v3.sh scripts, but NAME was different - then they’re saved in different folders.

Running GATK Queue pipeline scripts

Now that you’ve created manifest file called ${NAME}_${VERSION}.txt and ${NAME}_${VERSION}.sh, you are almost ready to run the pipeline.

Note

You should be using screen quite often now. You’ll want to run your pipeline in a screen session, because then even when you log out of TSCC, the pipeline will still be running.

If you’ve already run screen, reattach the session with:

screen -x

If that gives you the error: There is no screen to be attached., then you haven’t run screen before, and you can start a session with:

screen

These scripts take quite a bit of memory to compile, so to be nice to everyone, log into a compute node by requesting an interactive job on TSCC. Also your script may just run out of memory and fail if you’re not a compute node, so that is even more incentive to log into a compute node!

This command will create an interactive job for 40 hours, on the home-scrm queue, and with 1 node and 1 processor (you don’t need more than that for the script, and the script will submit jobs that request more nodes/processors for compute-intensive stuff like STAR or Sailfish). If you have a lot of samples, you may need more time, but try just 40 hours first.

So here’s what you do:

qsub -I -l walltime=40:00:00 -q home-scrm
# Wait for the job to be ready. This may take a while
cd ~/projects/$NAME/scripts
sh ${NAME}_${VERSION}.sh

For example, for the singlecell_pnms project from before, I would do:

qsub -I -l walltime=40:00:00 -q home-scrm
# Waited for job to get scheduled/be ready ....
cd ~/projects/singlecell_pnms/scripts
sh singlecell_pnms_se_v3.sh

This outputs:

INFO  12:24:42,840 QScriptManager - Compiling 1 QScript
INFO  12:24:55,100 QScriptManager - Compilation complete
INFO  12:24:55,359 HelpFormatter - ----------------------------------------------------------------------
INFO  12:24:55,359 HelpFormatter - Queue v2.3-1095-gdb26a3f, Compiled 2015/01/26 15:22:32
INFO  12:24:55,359 HelpFormatter - Copyright (c) 2012 The Broad Institute
INFO  12:24:55,359 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  12:24:55,360 HelpFormatter - Program Args: -S /home/obotvinnik/gscripts/qscripts/analyze_rna_seq.scala --input singlecell_pnms_se_v3.txt --adapter TCGTATGCCGTCTTCTGCTTG --adapter ATCTCGTATGCCGTCTTCTGCTTG --adapter CGACAGGTTCAGAGTTCTACAGTCCGACGATC --adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -qsub -jobQueue home-yeo -runDir /home/obotvinnik/projects/singlecell_pnms/analysis/singlecell_pnms_se_v3 -log singlecell_pnms_se_v3.log --location singlecell_pnms_se --strict -keepIntermediates --not_stranded -single_end -run
INFO  12:24:55,360 HelpFormatter - Date/Time: 2015/01/27 12:24:55
INFO  12:24:55,360 HelpFormatter - ----------------------------------------------------------------------
INFO  12:24:55,361 HelpFormatter - ----------------------------------------------------------------------
INFO  12:24:55,370 QCommandLine - Scripting AnalyzeRNASeq
INFO  12:24:58,436 QCommandLine - Added 773 functions
INFO  12:24:58,438 QGraph - Generating graph.
INFO  12:24:58,664 QGraph - Running jobs.
... more output ...

Pipeline frequently asked questions (FAQ)

How do I ...

... deal with multiple species? Do I have to create different manifest files?
Fortunately, no! You can create a single manifest file.

Looking at /home/gpratt/projects/msi2/scripts, we see the file msi2_v2.txt, which has the contents:

/home/gpratt/projects/msi2/data/msi2/MSI2_ACAGTG_ACAGTG_L008_R1.fastq.gz        hg19
/home/gpratt/projects/msi2/data/msi2/MSI2_CAGATC_CAGATC_L008_R1.fastq.gz        mm9
/home/gpratt/projects/msi2/data/msi2/MSI2_GCCAAT_GCCAAT_L008_R1.fastq.gz        mm9
/home/gpratt/projects/msi2/data/msi2/MSI2_TAGCTT_TAGCTT_L008_R1.fastq.gz        hg19
/home/gpratt/projects/msi2/data/msi2/MSI2_TGACCA_TGACCA_L008_R1.fastq.gz        hg19
/home/gpratt/projects/msi2/data/msi2/MSI2_TTAGGC_TTAGGC_L008_R1.fastq.gz        hg19

So you can reference multiple genomes in a single manifest file!

... deal with both single-end and paired-end reads in one project? Do I need to create separate manifest files?
Yes, unfortunately. :( Check out the singlecell_pnms project above as an example.
... see the documentation for a queue script?
This command will show documentation for analyze_rna_seq.scala. For further documentation, see the GATK Queue website.
java -Xms512m -Xmx512m -jar /projects/ps-yeolab/software/gatk/dist/Queue.jar -S ~/gscripts/qscripts/analyze_rna_seq.scala

analyze_rna_seq

The queue script analyze_rna_seq.scala runs or generates:

  1. RNA-SeQC
  2. cutadapt
  3. miso
  4. OldSplice
  5. Sailfish
  6. A->I editing predictions
  7. bigWig files
  8. Counts of reads mapping to repetitive elements
  9. Estimates of PCR Duplication

Detailed description of analyze_rna_seq.scala outputs.

analyze_rna_seq_gently

The queue script analyze_rna_seq_gently.scala runs:

  1. RNA-SeQC
  2. ...

Combining outputs from the pipeline into matrices

See the rnaseek software for how to combine Sailfish, STAR and MISO outputs.