TSCC ==== TSCC_ houses our 640-core supercomputer as part of a condo resource sharing system which allows other researchers (mostly bioinformaticians from the Ideker, Ren, Zhang labs) to use our portion for their jobs (with an 8 hour time limit) which allows us to use their portions when we need extra computing power. We lead the pack in terms of pure crunching power, with the Ideker lab in close second place with 512 cores. But they have 2x the RAM per core for jobs that require lots of memory, so our purchases are complimentary and sharing is encouraged. The main contacts for questions about TSCC are the `dry lab`_ and `TSCC users`_ mailing lists. The main contact for problems with TSCC is `Jim Hayes`_ Important rules --------------- .. warning:: 1. All sequencing data is stored in the ``/projects/ps-yeolab/seqdata`` folder 2. The folder ``seqdata`` is intended as permanent storage and no folders or files there should ever be deleted 3. Do not process data in ``seqdata``. Use the directory structure described in `Organize your home directory`_ to create a ``scratch`` folder for all data processing. First Steps ----------- Your first login session should include some of the following commands, which will familiarize you with the cluster, teach you how to do some useful tasks on the queue, and help you set up a common directory structure shared by everyone in the lab. Log on! ~~~~~~~ First, log in to TSCC! .. code:: ssh YOUR_TSCC_USERNAME@tscc-login2.sdsc.edu TSCC has two login nodes, ``login1`` and ``login2`` for load-balancing (i.e. so if you just log on to ``tscc.sdsc.edu``, it'll choose whichever login node is less occupied. We're logging in to a specific node because then we'll always have our screen session on the same node) This is logging specifically on to ``login2``. You can do ``login1`` if you like, as well, to balance it out :) Start a screen session ~~~~~~~~~~~~~~~~~~~~~~ NOTE - You can skip this on your initial setup, but you should come back and do this later because it is cool. Screen_ is an awesome tool which allows you to have multiple "tabs" open in the same login session, and you can easily transition between screens. Plus, they're persistent, so you can leave something running in a screen session, log out of TSCC, and it will still be running! Amazing! If you have suggestions of things to add to this ``.screenrc``, feel free to make a pull request on Olga's rcfiles_ github repo. To get a nice status bar at the bottom of your terminal window, get this ``.screenrc`` file: .. code:: cd wget https://raw.githubusercontent.com/olgabot/rcfiles/master/.screenrc .. note:: The control letter is ``j``, not ``a`` in the documentation above, so for example to create a new window, do ``Ctrl-j c`` and to kill the current window, do ``Ctrl-j k``. Do ``Ctrl-j j`` to switch between windows, and ``Ctrl-j #``, where # is some window number, to switch to a numbered tab specifically (e.g. ``Ctrl-j 2`` to switch to tab number 2. This ``.screenrc`` adds a status bar at the bottom of your screen, like this: .. image:: screen_status.png Now to start a screen session do: .. code:: screen If you're re-logging in and you have an old screen session, do this to "re-attach" the screen window. .. code:: screen -x Every time you log in to TSCC, you'll want to reattach the screens from before, so the first step I always take when I log in to TSCC is exactly that, ``screen -x``. Get ``gscripts`` access to software ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0. Before you're able to clone the gscripts github repo, you'll need to add your ssh keys on TSCC to your Github account. Follow `Github's instructions for generating SSH keys`_. 1. First, clone the ``gscripts`` github repo to your home directory on TSCC (this assumes you've already created a github account). .. code:: # cd git clone http://github.com/YeoLab/gscripts 2. Add this line to the **end** of your ``~/.bashrc`` file (using either ``emacs`` or ``vi``/``vim``, your choice) .. code:: source ~/gscripts/bashrc/tscc_bash_settings_current .. note:: Make sure to add ``source ~/gscripts/bashrc/tscc_bash_settings_current`` to your ``~/.bashrc`` file so that it always loads up the correct yeolab environment variables! 3. "source the ``.bashrc`` file so you load all the convenient environment variables we've created. .. code:: source ~/.bashrc Download and install anaconda ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Download the Anaconda Python/R package manager using wget (web-get). The link below is from the Anaconda downloads page. .. code:: wget http://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh To install ANaconda, run the shell script with bash (this will take some time). It will ask you a bunch of questions, and use the defaults for them (press enter for all) .. code:: bash Anaconda2-4.1.1-Linux-x86_64.sh To activate anaconda, source your .bashrc: .. code:: source ~/.bashrc Make sure your Python is point to the Anaconda python with: .. code:: which python The output should look something like: .. code :: ~/anaconda2/bin/python Make a virtual environment on TSCC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ WARNING - this is easy to get messed up. While this is a nice tool, it is not absolutely necessary upon initial setup and might be best to wait and configure environments after you have a better understanding of how they work. On TSCC, the easiest way to create a virtual evironment (aka ``virtualenv``) is by making one off of the ``base`` environment, which already has a bunch of modules that we use all the time (``numpy``, ``scipy``, ``matplotlib``, ``pandas``, ``scikit-learn``, ``ipython``, the list goes on). Here's how you do it: .. note:: The command ``$USER`` is meant to be literal, meaning you can exactly copy the below command, and TSCC will create an environment with your username. If you don't believe me, compare the output of: .. code:: echo USER to the output of: .. code:: echo $USER The second one should output your TSCC username, because the ``$`` dollar sign indicates to the shell that you're asking for the variable ``$USER``, not the literal word "USER". .. code:: conda create --clone base --name $USER .. note:: You can also create an environment from scratch using ``conda`` to install all the Anaconda Python packages, and then using ``pip`` in the environment to install the remaining packages, like so: .. code:: conda create --yes --name ENVIRONMENT_NAME pip numpy scipy cython matplotlib nose six scikit-learn ipython networkx pandas tornado statsmodels setuptools pytest pyzmq jinja2 pyyaml pymongo biopython markupsafe seaborn joblib semantic_version source activate ENVIRONMENT_NAME conda install --yes --channel https://conda.binstar.org/daler pybedtools conda install --yes --channel https://conda.binstar.org/kyleabeauchamp fastcluster pip install gspread brewer2mpl husl gffutils matplotlib-venn HTSeq misopy pip install https://github.com/YeoLab/clipper/tarball/master pip install https://github.com/YeoLab/gscripts/tarball/master pip install https://github.com/YeoLab/flotilla/tarball/master These commands is how the ``base`` environment was created. Then activate your environment with .. code:: source activate $USER You'll probably stay in this environment all the time. .. warning:: Make sure to add ``source activate $USER`` to your ``~/.bashrc`` file! Then you will always be in your environment If you need to switch to another environment, then exit your environment with: .. code:: source deactivate .. note:: Now that you've created your own environment, go to your gscripts folder and install your own personal gscripts, to make sure it's the most updated version. .. code:: cd ~/gscripts pip install . # The "." means install "this," as in "this folder where I am" Add the location of ``GENOME`` to your ``~/.bashrc`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To run the analysis pipeline, you will need to specify where the genomes are on TSCC, and you can do this by adding this line to your ``~/.bashrc``: .. code:: GENOME=/projects/ps-yeolab/genomes Organize your home directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create an organized ``home`` directory structure following a common template, so others can find your scripts, workflows, and even final results/papers! Do not store actual data in your home directory as is is limited to 100 GB only. Link your scratch directory to your home ++++++++++++++++++++++++++++++++++++++++ The "``scratch``" storage on TSCC is for temporary (after 90 days it gets purged) storage. It's very useful for storing intermediate files, and outputs from compute jobs because the data there is stored on solid-state drives (SSDs, currently 300TB) which have incredibly fast read-write speeds, which is perfect for outputs from alignment algorithms. It can be annoying to go back and forth between your scratch directory, so it's convenient to have a link to your scratch from home, which you can make like this: .. code:: ln -s /oasis/tscc/scratch/$USER $HOME/scratch .. note:: This is virtually unlimited temporary storage space, designed for heavy I/O. Aside from common reference files (e.g. Genomes, GENCODE, etc.) this should be the only space that you can read/write to from your scripts/workflows! The '''parallel''' throughput of this storage is 100 GB/s (e.g. 10 tasks can each read/write at 10 GB/s at the same time) .. warning:: Anything saved here is subject to deletion without warning after 3 months or less of storage. In particular, the large ``.sam`` and ``.bam`` files can get deleted, even though their ``.done`` files (produced by the GATK Queue RNA-seq pipeline as a placeholder) will still exist, and they will seem done to the pipeline. To avoid lost data, here are a few steps: 1. Keep your metadata sample/cell counts are in your ``$HOME/projects`` or ``/projects/ps-yeolab/$USER`` folder, which don't get purged periodically. 2. Delete ``*.done`` files when re-rerunning a partially eroded pipeline run. 3. Use this recursive touch command to "refresh" the decay clock on your files before important meetings and re-analysis steps: .. code:: cd important_scratch_dir find . | xargs touch Create workflow and projects folders ++++++++++++++++++++++++++++++++++++ Create ``~/workflows`` for your personal bash, makefile, queue, and so on, scripts, before you add them to gscripts, and ``~/projects`` for your projects to organize figures, notebooks, final results, and even manuscripts. .. code:: mkdir ~/workflows ~/projects Here's an example project directory structure: .. code:: $ ls -lha /home/gpratt/projects/fox2_iclip/ total 9.5K drwxr-xr-x 2 gpratt yeo-group 5 Sep 16 2013 . drwxr-xr-x 40 gpratt yeo-group 40 Nov 24 12:20 .. lrwxrwxrwx 1 gpratt yeo-group 49 Aug 21 2013 analysis -> /home/gpratt/scratch/projects/fox2_iclip/analysis lrwxrwxrwx 1 gpratt yeo-group 45 Aug 21 2013 data -> /home/gpratt/scratch/projects/fox2_iclip/data lrwxrwxrwx 1 gpratt yeo-group 50 Aug 21 2013 scripts -> /home/gpratt/processing_scripts/fox2_iclip/scripts .. note:: Notice that all of these are soft-links to either ``~/scratch`` or some other processing scripts. Let us see your stuff +++++++++++++++++++++ Make everything readable by other yeo lab members and restrict access from other users (per HIPAA/HITECH requirements) .. code:: chmod -R g+r ~/ chmod -R g+r ~/scratch/ chmod -R o-rwx ~/ chmod -R o-rwx ~/scratch/ But ``git`` will get mad at you if your ~/.ssh keys private keys are visible by others, so make them visible to only you via: .. code:: chmod -R go-rwx ~/.ssh/ In the end, your '''home''' directory should look something like this: .. code:: $ ls -l $HOME lrwxrwxrwx 1 bkakarad yeo-group 29 Jun 24 2013 scratch -> /oasis/tscc/scratch/bkakarad/ drwxr-x---+ 2 bkakarad yeo-group 2 Jun 24 2013 gscripts drwxr-x---+ 3 bkakarad yeo-group 3 Jun 24 2013 projects drwxr-x---+ 2 bkakarad yeo-group 2 Jun 24 2013 workflows IPython notebooks on TSCC ------------------------- This has two sections: Setup and Running. They should be done in order :) Setup IPython notebooks on TSCC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. First, on your personal computer, you will want to set up `passwordless ssh`_ from your laptop to TSCC. For reference, ``a@A`` is you from your laptop, and ``b@B`` is TSCC. So everywhere you see ``b@B``, replace that with ``yourusername@tscc.sdsc.edu``. For ``a@A``, since your laptop likely doesn't have a fixed IP address or a way to log in to it, you don't need to worry about replacing it. Instead, use ``a@A`` as a reference point for whether you should be doing the command from your laptop (``a@A``) or TSCC (``b@B``) 2. To set up IPython notebooks on TSCC, you will want to add some ``alias`` variables to your ``~/.bash_profile`` (for Mac) or ``~/.bashrc`` (for Linux) .. code:: IPYNB_PORT=[some number above 1024] alias tscc='ssh obotvinnik@tscc-login2.sdsc.edu' This way, I can just type ``tscc`` and log onto ``tscc-login2`` **specifically**. It is important for IPython notebooks that you always log on to the same node. You can use ``tscc-login1`` instead, too, this is just what I have set up. Just replace my login name ("``obotvinnik``") with yours. 2. To activate all the commands you just added, on your laptop, type ``source ~/.bash_profile``. (``source`` is a command which will run all the lines in the file you gave it, i.e. here it will assign the variable ``IPYNB_PORT`` to the value you gave it, and run the ``alias`` command so you only have to type ``tscc`` to log in to TSCC) 2. Next, type ``tscc`` and log on to the server. 3. On TSCC, add these lines to your ``~/.bashrc`` file. .. code:: IPYNB_PORT=same number as the above IPYNB_PORT from your laptop alias ipynb="ipython notebook --no-browser --port $IPYNB_PORT &" alias sshtscc="ssh -NR $IPYNB_PORT:localhost:$IPYNB_PORT tscc-login2 &" Notice that in ``sshtscc``, I use the same port as I logged in to, `tscc-login2`. The ampersands "`&`" at the end of the lines tell the computer to run these processes in the background, which is super useful. 4. You'll need to run ``source ~/.bashrc`` again on TSCC, so the ``$IPYNB_PORT`` variable, and ``ipynb``, ``sshtscc`` aliases are available. 5. Set up passwordless ssh between the compute nodes and TSCC with: .. code:: cat .ssh/id_rsa.pub >> .ssh/authorized_keys 6. Back on your home laptop, edit your `~/.bash_profile` on macs, `~/.bashrc` for other unix machines to add the line: .. code:: alias tunneltscc="ssh -NL $IPYNB_PORT\:localhost:$IPYNB_PORT obotvinnik@tscc-login2.sdsc.edu &" Make sure to replace "``obotvinnik``" with your TSCC login :) It is also important that these are double-quotes and not single-quotes, because the double-quotes evaluate the ``$IPYNB_PORT`` to the number you chose, e.g. ``4000``, whereas the single-quotes will keep it as the letters ``$IPYNB_PORT``. Run IPython Notebooks on TSCC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now that you have everything configured, you can run IPython notebooks on TSCC! Here are the steps to follow. 1. Log on to TSCC 4. Now that you have those set up, start up a ``screen`` session, which allows you to have something running continuously, without being logged in. .. code:: screen -x .. note:: If this gives you an error saying "There is no screen to be attached" then you need to run plain old ``screen`` (no ``-x``) first. If this gives you an error saying you need to pick one session, make life easier for yourself and pick one to kill all the windows in, (using ``Ctrl-j K`` if you're using the ``.screenrc`` that I recommended earlier, otherwise the default is ``Ctrl-a K``). Once you've killed all screen sessions except for one, you can run ``screen -x`` with abandon, and it will connect you to the only one you have open. 2. In this ``screen`` session, now request an interactive job, e.g.: .. code:: qsub -I -l walltime=2:00:00 -q home-yeo -l nodes=1:ppn=2 3. Wait for the job to start. 4. Run your TSCC-specific aliases on the compute node: .. code:: ipynb sshtscc 4. **Back on your laptop**, now run your tunneling command: .. code:: tunneltscc 5. Open up ``http://localhost:[YOUR IPYNB PORT]`` on your browser. Share your Dropbox account for easy figure syncing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is not necessary, but a nice feature to have. For the initial setup, I would skip this step. Instructions for this can be found here: http://www.dropboxwiki.com/tips-and-tricks/using-the-official-dropbox-command-line-interface-cli 1) Make a bin directory in your home .. code:: mkdir -p ~/bin 2) Download the CLI interface via wget: .. code:: wget -O ~/bin/dropbox.py "https://www.dropbox.com/download?dl=packages/dropbox.py" 3) set the permissions so you can execute the CLI interface: .. code:: chmod +x ~/bin/dropbox.py 4) Run the script with help to check if it is working. A help menu should come up. .. code:: ~/bin/dropbox.py help 5) Now download the Dropbox daemon for 64-bit Linux and run it using the following two commands: (more information found here https://www.dropbox.com/install?os=lnx) .. code:: cd ~ && wget -O - "https://www.dropbox.com/download?plat=lnx.x86_64" | tar xzf - ~/.dropbox-dist/dropboxd 6) This should generate a link. Paste the link into a web browser and open your dropbox. This will complete the sync. Now try it by running the following dropbox start command on TSCC. .. code:: ~/bin/dropbox.py start All done! If everything worked properly you should now be able to view your dropbox files on tscc located in the newly created `Dropbox` directory located in your home. Files saved to that location can also be viewed in the web browser version of Dropbox. NOTE: If you have a lot of stuff in your dropbox it may be too big to sync without using a interactive job. If you see the error listed below, try starting an interactive job FIRST and then re running the dropbox start command. .. code:: Fatal python error: couldn't create thread state for new thread Installing and upgrading Python packages ---------------------------------------- To install Python packages first try ``conda install``: .. code:: conda install If there is no package in conda, then try `bioconda` (a google search for your package along with the keyword "bioconda" will tell you if this is available): .. code:: conda install -c bioconda If there is no package in conda, then (and ONLY then) try `pip`: .. code:: pip install To upgrade packages, do: (using ``conda``) .. code:: conda update (using ``pip``) .. code:: pip install -U NOTE - you can see if your package is correctly installed in your anaconda with: .. code:: which Alternatively, you can open python on your command line with: .. code:: python And then try to import the package you just installed. If it doesn't throw an error, it installed successfully! .. code:: import To get out of python on your command line: .. code:: quit() Installing R packages (beta!) ---------------------------- You can also use ``conda`` to install ``R`` and ``R`` packages. This will allow you to access a jupyter notebook in R rather than python code which can be VERY helpful for some analysis software that runs in R. .. code:: conda install -c r r-essentials After the install, load your jupyter notebooks and start a new notebook. You should see options avilable to choose between Python2 and R. Submitting and managing compute jobs on TSCC -------------------------------------------- Submit jobs ~~~~~~~~~~~ To submit a script that you wrote, in this case called ``myscript.sh``, to TSCC, do: .. code:: qsub -q home-yeo -l nodes=1:ppn=2 -l walltime=0:30:00 myscript.sh Submit interactive jobs ~~~~~~~~~~~~~~~~~~~~~~~ To submit interactive jobs, do: .. code:: qsub -I -q home-yeo -l nodes=1:ppn=2 -l walltime=0:30:00 Submit jobs to ``home-scrm`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To submit to the ``home-scrm`` queue, add ``-W group_list=scrm-group`` to your ``qsub`` command: .. code:: qsub -I -l walltime=0:30:00 -q home-scrm -W group_list=scrm-group Submitting many jobs at once ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you have a bunch of commands you want to run at once, you can use this script to submit them all at once. In the next example, ``commands.sh`` is a file has the commands you want on their own line, i.e. one command per line. .. code:: java -Xms512m -Xmx512m -jar /home/yeo-lab/software/gatk/dist/Queue.jar \ -S ~/gscripts/qscripts/do_stuff.scala --input commands.sh -run -qsub \ -jobQueue -jobLimit --ncores --jobname -startFromScratch This runs a scala job that submits sub-jobs to the PBS queue under name you fill in where now sits as a placeholder. Check job status, aka "why is my job stuck?" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Check the status of your jobs: .. code:: qme .. note:: This will only work if you have followed instructions and have ``source``'d the ``~/gscripts/tscc_bash_settings_current`` :) ``qme`` outputs, .. code:: (olga)[obotvinnik@tscc-login2 ~]$ qme tscc-mgr.sdsc.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - --------- 2006527.tscc-mgr.local obotvinnik home-yeo STDIN 35367 1 16 -- 04:00:00 R 02:35:36 2007542.tscc-mgr.local obotvinnik home-yeo STDIN 6168 1 1 -- 08:00:00 R 00:28:08 2007621.tscc-mgr.local obotvinnik home-yeo STDIN -- 1 16 -- 04:00:00 Q -- Check job status of array jobs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Check the status of your array jobs, you need to specify ``-t`` to see the status of the individual array pieces. .. code:: qstat -t Killing jobs ~~~~~~~~~~~~ If you have a job you want to stop, kill it with ``qdel JOBID``, e.g. .. code:: qdel 2006527 Kill an array job ~~~~~~~~~~~~~~~~~ If the job is an array job, you'll need to add brackets, like this: .. code:: qdel 2006527[] Kill all your jobs ~~~~~~~~~~~~~~~~~~ To kill all the jobs that you've submitted, do: .. code:: qdel $(qselect -u $USER) Which queue do I submit to? (check status of queues) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Check the status of the queue (so you know which queues to NOT submit to!) .. code:: qstat -q Example output is, .. code:: (olga)[obotvinnik@tscc-login2 ~]$ qstat -q server: tscc-mgr.local Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- home-dkeres -- -- -- -- 2 0 -- E R home-komunjer -- -- -- -- 0 0 -- E R home-ong -- -- -- -- 2 0 -- E R home-tg -- -- -- -- 0 0 -- E R home-yeo -- -- -- -- 3 1 -- E R home-visres -- -- -- -- 0 0 -- E R home-mccammon -- -- -- -- 15 29 -- E R home-scrm -- -- -- -- 1 0 -- E R hotel -- -- 168:00:0 -- 232 26 -- E R home-k4zhang -- -- -- -- 0 0 -- E R home-kkey -- -- -- -- 0 0 -- E R home-kyang -- -- -- -- 2 1 -- E R home-jsebat -- -- -- -- 1 0 -- E R pdafm -- -- 72:00:00 -- 1 0 -- E R condo -- -- 08:00:00 -- 18 6 -- E R gpu-hotel -- -- 336:00:0 -- 0 0 -- E R glean -- -- -- -- 24 75 -- E R gpu-condo -- -- 08:00:00 -- 16 36 -- E R home-fpaesani -- -- -- -- 4 2 -- E R home-builder -- -- -- -- 0 0 -- E R home -- -- -- -- 0 0 -- E R home-mgilson -- -- -- -- 0 4 -- E R home-eallen -- -- -- -- 0 0 -- E R ----- ----- 321 180 So right now is not a good time to submit to the ``hotel`` queue, since it has a bunch of both running and queued jobs! Show available "Service Units" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ List the available Service Units (1 SU = 1 core*hour) ... for a quick ego boost. Also note that our supercomputer is separated in two: yeo-group and scrm-group, but the total balance is 5.29 million SU, just enough secure us the top honors :-) .. code:: gbalance | sort -nrk 3 | head Id Name Amount Reserved Balance CreditLimit Available -- -------------------- ------- -------- ------- ----------- --------- 19 tideker-group 5211035 27922 5183113 0 5183113 82 yeo-group 3262925 0 3262925 0 3262925 81 scrm-group 2039328 0 2039328 0 2039328 14 mgilson-group 663095 208000 455095 0 455095 73 nanosprings-ucm 650000 0 650000 0 650000 17 kkey-group 635056 7104 627952 0 627952 16 k4zhang-group 534430 0 534430 0 534430 List the available TORQUE queues, for a quick boost in motivation! .. code:: qstat -q Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- home-tideker -- -- -- 16 1 0 -- E R home-visres -- -- -- 1 0 0 -- E R hotel -- -- 72:00:00 -- 25 18 -- E R home-k4zhang -- -- -- 4 21 0 -- E R home-kkey -- -- -- 5 0 0 -- E R pdafm -- -- 72:00:00 -- 0 0 -- E R condo -- -- 08:00:00 -- 0 0 -- E R glean -- -- -- -- 0 0 -- E R home-builder -- -- -- 8 0 0 -- E R home -- -- -- -- 0 0 -- E R home-ewyeo -- -- -- 15 0 0 -- E R home-mgilson -- -- -- 8 0 0 -- E R ----- ----- 47 18 Show available processors ~~~~~~~~~~~~~~~~~~~~~~~~~ To show available processors, do .. code:: showbf Show specs of all nodes ~~~~~~~~~~~~~~~~~~~~~~~ .. code:: pbsnodes -a Uploading Data to Amazon S3 buckets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1) Create an AWS AIM user Sign into AWS Console https://console.aws.amazon.com/ and click on Identity & Access Management (IAM) - On the left, click Users - At the top, click Create New Users - Create a username and generate an access key - Click Create in the lower right. - On the next page, click Download Credentials in the lower right to download the access key for your new user - Safeguard the downloaded file. It contains a key which is like a password for the AWS account. - Click Close in the lower right. - While still in the IAM Users page, cluck on the newly-created user's name. - Click the Permissions tab - Under Manages Policies, click Attach Policy - Type "S3" in the Gilter fiels and check the AmazonS3FullAccess policy. - Click Attach Polucy in the power right. - Now your username is setip to use AWS S3 services. 2) Download and install the AWS Command Line Interface (CLI) - Download from https://aws.amazon.com/cli/ - Open a terminal. Run: ..code:: aws configure - Fill in the AWS Access Key ID and AWS Secret Access Key when prompted using the information downloaded in the credentials file above. - Pick an appropriate region name: http://docs.aws.amazon.com/general/latest/gr/rande.html - I used the default output format by not entering any information when prompted at that step. 3) Create a new Bucket - Login to your AWS accounts. Click on Services -> S3 - In the upper left hand corner, click "Create Bucket" - Name your bucket as you please! 4) From the command line, upload your data to the newly created bucket. - You can find information about the available commands here: http://docs.aws.amazon.com/cli/latest/userguide/using-s3-commands.html - This is a sample of the command I used to upload my data: ..code:: aws s3 sync --acl bucket-owner-full-control --sse AES256 /home/ecwheele/scratch/mds_splicing_v4/final_bams/ s3://mds-splicing-v4-bams Random notes ------------ Software goes in ``/projects/ps-yeolab/software`` Make sure to recursively set group read/write permissions to the software directory so others can use and update the common software, using: .. code:: chmod ug+rw /projects/ps-yeolab/software If your'e installing something from source and using ``./configure`` and ``make`` and all that, then always set the flag ``--prefix=/projects/ps-yeolab/software`` when you run ``./configure`` .. code:: ./configure --prefix /projects/ps-yeolab/software When possible install bins to ``/projects/ps-yeolab/software/bin`` Running RNA-seq, CLIP-Seq, Ribo-Seq, etc qscripts GATK Queue pipelines ---------------------------------------------------------------------- We use the Broad Institute's Genome Analysis Toolkit (GATK_) Queue_ software to run our pipelines. This software solves a lot of problems for us, such as dealing with multiple-stage pipelines that have cross-dependencies (e.g. you can't calculate splicing until you've mapped the reads, and you can't map the reads until after you've removed adapters and repetitive genomic regions from them), and properly scheduling jobs so that one person's analysis doesn't completely take over the compute cluster. Gabe has created a bunch of helpful template scripts for GATK Queue in his folder ``/home/gpratt/templates``: .. code:: $ ls -lh /home/gpratt/templates total 26K -rwxr-xr-x 1 gpratt yeo-group 660 May 7 2014 bacode_split.sh -rwxr-xr-x 1 gpratt yeo-group 554 May 7 2014 bacode_split.sh~ -rwxr-xr-x 1 gpratt yeo-group 524 Sep 18 00:08 #clipseq.sh# -rwxr-xr-x 1 gpratt yeo-group 524 Jul 12 2014 clipseq.sh -rwxr-xr-x 1 gpratt yeo-group 516 Mar 26 2014 clipseq.sh~ -rwxr-xr-x 1 gpratt yeo-group 473 Aug 21 18:47 riboseq.sh -rwxr-xr-x 1 gpratt yeo-group 528 Aug 21 18:46 riboseq.sh~ -rwxr-xr-x 1 gpratt yeo-group 530 Sep 5 17:29 rnaseq.sh -rwxr-xr-x 1 gpratt yeo-group 527 Mar 26 2014 rnaseq.sh~ Each Queue job requires a manifest file with a list of all files to process, and the genome to process them on. .. warning:: All further instructions depend on you having followed the directions in `Create workflow and projects folders`_, where for this particular project, you've created these folders: .. code:: ~/projects/PROJECT_NAME ~/processing_scripts/PROJECT_NAME/scripts ~/scratch/PROJECT_NAME/data ~/scratch/PROJECT_NAME/analysis And that you've linked the scratch and home directories correctly. For example, here's how you can create the project directory structure for a project called ``singlecell_pnms``: .. code:: NAME=singlecell_pnms mkdir -p ~/projects/$NAME ~/scratch/$NAME ~/scratch/$NAME/data ~/scratch/$NAME/analysis ~/processing_scripts/$NAME/scripts ln -s ~/scratch/$NAME/data ~/projects/$NAME/data ln -s ~/scratch/$NAME/analysis ~/projects/$NAME/analysis ln -s ~/processing_scripts/$NAME/scripts ~/projects/$NAME/scripts Here's an example queue script for single-end, not strand-specific RNA-seq, from the file ``singlecell_pnms_se_v3.sh``: .. code:: #!/bin/bash NAME=singlecell_pnms_se VERSION=v3 DIR=singlecell_pnms java -Xms512m -Xmx512m -jar /projects/ps-yeolab/software/gatk/dist/Queue.jar -S $HOME/gscripts/qscripts/analyze_rna_seq.scala --input ${NAME}_${VERSION}.txt --adapter TCGTATGCCGTCTTCTGCTTG --adapter ATCTCGTATGCCGTCTTCTGCTTG --adapter CGACAGGTTCAGAGTTCTACAGTCCGACGATC --adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -qsub -jobQueue home-yeo -runDir ~/projects/${DIR}/analysis/${NAME}_${VERSION} -log ${NAME}_${VERSION}.log --location ${NAME} --strict -keepIntermediates --not_stranded -single_end -run Notice that the "``--input``" is the file ``${NAME}_${VERSION}.txt``, which translates to ``singlecell_pnms_se_v3.txt`` in this case, since ``NAME=singlecell_pnms_se`` and ``VERSION=v3`` are defined at the beginning of the file. This file is the "manifest" of the sequencing run. In the case of single-end reads, this is a file where each line has, ``/path/to/read1.fastq.gz\tspecies\n``, where ``\t`` indicates a tab (using the ```` character), and ``\n`` indicates a new line, created by ````. Here is the first 10 lines of ``singlecell_pnms_se_v3.txt`` (obtained via ``head singlecell_pnms_se_v3.txt``): .. code:: /home/obotvinnik/projects/singlecell_pnms/data/CVN_01_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_02_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_03_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_04_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_05_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_06_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_07_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_08_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_09_R1.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/CVN_10_R1.fastq.gz hg19 For paired-end, not strand-specific RNA-seq, here's the script of the file ``singlecell_pnms_pe_v3.sh`` .. code:: #!/bin/bash NAME=singlecell_pnms_pe VERSION=v3 DIR=singlecell_pnms java -Xms512m -Xmx512m -jar /projects/ps-yeolab/software/gatk/dist/Queue.jar -S $HOME/gscripts/qscripts/analyze_rna_seq.scala --input ${NAME}_${VERSION}.txt --adapter TCGTATGCCGTCTTCTGCTTG --adapter ATCTCGTATGCCGTCTTCTGCTTG --adapter CGACAGGTTCAGAGTTCTACAGTCCGACGATC --adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -qsub -jobQueue home-yeo -runDir ~/projects/${DIR}/analysis/${NAME}_${VERSION} -log ${NAME}_${VERSION}.log --location ${NAME} --strict -keepIntermediates --not_stranded -run Notice that the "``--input``" is the file ``${NAME}_${VERSION}.txt``, which translates to ``singlecell_pnms_pe_v3.txt`` in this case, since ``NAME=singlecell_pnms_pe`` and ``VERSION=v2`` are defined at the beginning of the file. This file is the "manifest" of the sequencing run. In the case of single-end reads, this is a file where each line has: ``read1.fastq.gz;read2.fastq.gz\tspecies\n``, where ``\t`` indicates a tab (using the ```` character), and ``\n`` indicates a new line, created by ````. Here is the first 10 lines of ``singlecell_pnms_pe_v3.txt`` (obtained via ``head singlecell_pnms_pe_v3.txt``): .. code:: /home/obotvinnik/projects/singlecell_pnms/data/M1_01_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_01_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_02_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_02_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_03_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_03_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_04_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_04_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_05_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_05_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_06_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_06_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_07_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_07_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_08_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_08_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_09_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_09_R2.fastq.gz hg19 /home/obotvinnik/projects/singlecell_pnms/data/M1_10_R1.fastq.gz;/home/obotvinnik/projects/singlecell_pnms/data/M1_10_R2.fastq.gz hg19 For this project, I had a mix of both paired-end and single-end reads, so that's why ``DIR`` is the same for both the ``singlecell_pnms_se_v3.sh`` and ``singlecell_pnms_pe_v3.sh`` scripts, but ``NAME`` was different - then they're saved in different folders. Running GATK Queue pipeline scripts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now that you've created manifest file called ``${NAME}_${VERSION}.txt`` and ``${NAME}_${VERSION}.sh``, you are almost ready to run the pipeline. .. note:: You should be using ``screen`` quite often now. You'll want to run your pipeline in a ``screen`` session, because then even when you log out of TSCC, the pipeline will still be running. If you've already run ``screen``, reattach the session with: .. code:: screen -x If that gives you the error: ``There is no screen to be attached.``, then you haven't run ``screen`` before, and you can start a session with: .. code:: screen These scripts take quite a bit of memory to compile, so to be nice to everyone, log into a compute node by requesting an interactive job on TSCC. Also your script may just run out of memory and fail if you're not a compute node, so that is even more incentive to log into a compute node! This command will create an interactive job for 40 hours, on the ``home-scrm`` queue, and with 1 node and 1 processor (you don't need more than that for the script, and the script will submit jobs that request more nodes/processors for compute-intensive stuff like STAR or Sailfish). If you have a lot of samples, you may need more time, but try just 40 hours first. So here's what you do: .. code:: qsub -I -l walltime=40:00:00 -q home-scrm # Wait for the job to be ready. This may take a while cd ~/projects/$NAME/scripts sh ${NAME}_${VERSION}.sh For example, for the ``singlecell_pnms`` project from before, I would do: .. code:: qsub -I -l walltime=40:00:00 -q home-scrm # Waited for job to get scheduled/be ready .... cd ~/projects/singlecell_pnms/scripts sh singlecell_pnms_se_v3.sh This outputs: .. code:: INFO 12:24:42,840 QScriptManager - Compiling 1 QScript INFO 12:24:55,100 QScriptManager - Compilation complete INFO 12:24:55,359 HelpFormatter - ---------------------------------------------------------------------- INFO 12:24:55,359 HelpFormatter - Queue v2.3-1095-gdb26a3f, Compiled 2015/01/26 15:22:32 INFO 12:24:55,359 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 12:24:55,359 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:24:55,360 HelpFormatter - Program Args: -S /home/obotvinnik/gscripts/qscripts/analyze_rna_seq.scala --input singlecell_pnms_se_v3.txt --adapter TCGTATGCCGTCTTCTGCTTG --adapter ATCTCGTATGCCGTCTTCTGCTTG --adapter CGACAGGTTCAGAGTTCTACAGTCCGACGATC --adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -qsub -jobQueue home-yeo -runDir /home/obotvinnik/projects/singlecell_pnms/analysis/singlecell_pnms_se_v3 -log singlecell_pnms_se_v3.log --location singlecell_pnms_se --strict -keepIntermediates --not_stranded -single_end -run INFO 12:24:55,360 HelpFormatter - Date/Time: 2015/01/27 12:24:55 INFO 12:24:55,360 HelpFormatter - ---------------------------------------------------------------------- INFO 12:24:55,361 HelpFormatter - ---------------------------------------------------------------------- INFO 12:24:55,370 QCommandLine - Scripting AnalyzeRNASeq INFO 12:24:58,436 QCommandLine - Added 773 functions INFO 12:24:58,438 QGraph - Generating graph. INFO 12:24:58,664 QGraph - Running jobs. ... more output ... Pipeline frequently asked questions (FAQ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ How do I ... ++++++++++++ ... deal with multiple species? Do I have to create different manifest files? Fortunately, no! You can create a single manifest file. Looking at ``/home/gpratt/projects/msi2/scripts``, we see the file ``msi2_v2.txt``, which has the contents: .. code:: /home/gpratt/projects/msi2/data/msi2/MSI2_ACAGTG_ACAGTG_L008_R1.fastq.gz hg19 /home/gpratt/projects/msi2/data/msi2/MSI2_CAGATC_CAGATC_L008_R1.fastq.gz mm9 /home/gpratt/projects/msi2/data/msi2/MSI2_GCCAAT_GCCAAT_L008_R1.fastq.gz mm9 /home/gpratt/projects/msi2/data/msi2/MSI2_TAGCTT_TAGCTT_L008_R1.fastq.gz hg19 /home/gpratt/projects/msi2/data/msi2/MSI2_TGACCA_TGACCA_L008_R1.fastq.gz hg19 /home/gpratt/projects/msi2/data/msi2/MSI2_TTAGGC_TTAGGC_L008_R1.fastq.gz hg19 So you can reference multiple genomes in a single manifest file! ... deal with both single-end and paired-end reads in one project? Do I need to create separate manifest files? Yes, unfortunately. :( Check out the ``singlecell_pnms`` project above as an example. ... see the documentation for a queue script? This command will show documentation for ``analyze_rna_seq.scala``. For further documentation, see the `GATK Queue website`_. .. code:: java -Xms512m -Xmx512m -jar /projects/ps-yeolab/software/gatk/dist/Queue.jar -S ~/gscripts/qscripts/analyze_rna_seq.scala analyze_rna_seq ~~~~~~~~~~~~~~~ The queue script ``analyze_rna_seq.scala`` runs or generates: 1. RNA-SeQC_ 2. cutadapt 3. miso 4. OldSplice 5. Sailfish 6. A->I editing predictions 7. bigWig files 8. Counts of reads mapping to repetitive elements 9. Estimates of PCR Duplication Detailed description of `analyze_rna_seq.scala`_ outputs. analyze_rna_seq_gently ~~~~~~~~~~~~~~~~~~~~~~ The queue script ``analyze_rna_seq_gently.scala`` runs: 1. RNA-SeQC_ 2. ... Combining outputs from the pipeline into matrices ------------------------------------------------- See the rnaseek_ software for how to combine Sailfish, STAR and MISO outputs. .. _TSCC: http://rci.ucsd.edu/computing/index.html .. _dry lab: dryyeo-l@googlegroups.com .. _TSCC users: tscc-l@mailman.ucsd.edu .. _Jim Hayes: jhayes@sdsc.edu .. _hub: https://hub.github.com/ .. _Screen: https://kb.iu.edu/d/acuy .. _rcfiles: https://github.com/olgabot/rcfiles .. _passwordless ssh: http://www.linuxproblem.org/art_9.html .. _GATK Queue website: http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue .. _RNA-SeQC: http://www.broadinstitute.org/cancer/cga/rna-seqc .. _analyze_rna_seq.scala: analyze_rna_seq .. _Github's instructions for generating SSH keys: https://help.github.com/articles/generating-ssh-keys/ .. _GATK: https://www.broadinstitute.org/gatk/ .. _Queue: http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue .. _R packages he's added so far: https://binstar.org/asmeurer/ .. _rnaseek: http://github.com/olgabot/rnaseek