If you use the GATK, at some point you might want to start
using Queue
to
a) build pipelines, thus saving time and effort while rerunning your analysis and making them less error-prone
b) use a compute cluster like LSF or Univa Grid Engine to distribute your jobs, potentially speeding up your pipeline many-fold.
I started out with just downloading the 'Queue.jar' from GATK and writing my qscripts in a simple text editor. Most of my time was lost on finding the right functions or classes to use and debugging the qscripts.
To efficiently write your own qscripts you should use the gatk development version in combination with the IntelliJ IDEA. Properly set up, IntelliJ is tremendously helpful, by automatically generating import statements, suggesting valid class functions, code highlighting and much more.
I had a hard time making my first steps to a working qscript, so I give some examples (see the other GATK use-cases).
Here is a
short summary of how to set up the development environment.
I am on Debian 8. I know it works likewise on 7 (since our compute cluster has 7 installed), but generally recommend the latest stable release of your Unix operating system as well as GATK (GATK 3.x that is. This all will not work on the upcomming GATK 4 release).
You also need java 8. openjdk works for me as well as oracle java.
If you need java 7 for most of your applications, than export java 8 temporarily to your path. For me it looks like this (shell command):
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$JAVA_HOME/bin:$PATH
Although it is 'only for developers'
use the GATK-protected repository!
So get the latest GATK-protected release here:
https://github.com/broadgsa/gatk-protected.
And the latest IntelliJ IDEA here:
https://www.jetbrains.com/idea/
If you don't already have i, get and install maven:
http://maven.apache.org/install.html
Check if Java home is set correctly to java 8:
Now go to the gatk-protected directory, and compile with:
This will compile the whole thing, and produce a GenomeAnalysisTK.jar and Queue.jar in
Later on, when you write your own classes, you need to recompile your version of gatk, for your changes to take effect. This can aghain be done with --verify, or faster with "mvn -Ddisable.shadepackage verify".
MORE INFORMATION
Now to set up the IntelliJ IDEA
(modified from
here):
- Run
mvn test-compile
in your git clone's root directory
- Open IntelliJ
File -> import project, select your git clone directory, then click "ok"
On the next screen, select "import project from external model", then "maven", then click "next"
Click "next" on the next screen without changing any defaults -- in particular:
- DON'T check "Import maven projects automatically"
- DON'T check "Create module groups for multi-module maven projects"
- On the "Select Profiles" screen, make sure private and protected ARE checked, then click "next".
- On the next screen, the "gatk-aggregator" project should already be checked for you -- if not, then check it.
- Click "next".
- Select the 1.8 SDK, then click "next".
- Select an appropriate project name (can be anything), then click "next" (or "finish", depending on your version of IntelliJ).
- Click "Finish" to create the new IntelliJ project.
It should look something like this:
Using qscripts
I haven't found a recommendation for this, but suggest you create your new qscripts (scala scripts) in
~/gatk-protected/protected/gatk-queue-extensions-distribution/src/main/qscripts/org/broadinstitute/gatk/queue/qscripts/
Putting them in other folders eventually results in them beeing "hard coded"
during compilation, so that code changes won't take effect until you compile the whole thing again.
To make this more clear:
In general, you have to compile only after creating new java or scala classes, like a new filter or walker, not for a pipeline qscript.
Now you are good to go.
To get an overview (or rather a glimpse) read the recent threads on
GATK development and
pipelining, but make sure to go through the comments. Many of the threads refer to old GATK versions, when ant was used instead of maven, and before the
sting-to-gatk-renaming, so keep that in mind.
For starters, you could play around with the HaplotypeCaller qscript in
~/gatk-protected/protected/gatk-queue-extensions-distribution/src/main/qscripts/org/broadinstitute/gatk/queue/qscripts/examples/ExampleHaplotypeCaller.scala
You only need a reference sequence and an accordingly mapped bam file to test it. Ideally, starting it (on a LSF cluster) would be as easy as
java -jar ~/gatk-protected/target/Queue.jar -S ~/gatk-protected/protected/gatk-queue-extensions-distribution/src/main/qscripts/org/broadinstitute/gatk/queue/qscripts/examples/ExampleHaplotypeCaller.scala -O raw.vcf -R ref.fa -I reads.bam -run
If this works, you can start playing around with the HaplotypeCaller Parameters, or processing multiple input files at once ('-I 1.bam -I 2.bam').