Donnerstag, 13. Oktober 2016

GATK use-cases: Implementing a new read filter

Another day, another GATK hack.

Stacks, the tool I use to analyze my RAD-Seq data has the constraint, that it should only be run with reads of the exact same length to calculate the rad tags. Depending on your setup, this concerns forward and reverse reads (in case of a double cut RAD-Seq), or forward reads alone and the reverse reads are not used to calculate RAD stacks. So why not just filter for reads of the exact length?
Well, most read filtering tools accept a range of length for both reads.  So the whole read-pair is filtered out, if one of the reads doesn't match the length criteria, even if the other read was perfectly fine.
What I needed was a filter, that made it possible to filter on criteria that were different for the two respective reads of a pair. This could rather easily be done with python and ngs_plumbing to filter the fastq files before mapping. But I wanted to filter the reads after mapping. Here's why:
One of the major advantages of paired reads over single reads is the more accurate mapping, because the information of both reads of a pair is considered to find the best matching position. Therefore people (me included) tend to use
only the properly paired reads for their analysis. By filtering the unmapped reads from the fastq files, I lose both reads of a pair, if the forward read is filtered out due to the length constraints. In contrast, when filtering after the mapping, it is okay to retain the reverser read, because the information of the forward read has already been used to find the best matching position. Thereby, I lose only half of the reads by filtering after than before the mapping.


Since I use GATK and Queue a lot I set myself on implementing such a read filter in GATK.

If you have gatk-protected-3.6 and IntelliJ IDEA installed (using java8), you're ready to go.

First, I created a new java class "forwardReadLengthFilter.java" in


gatk-protected/public/gatk-engine/src/main/java/org/broadinstitute/gatk/engine/filters/




 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
package org.broadinstitute.gatk.engine.filters;
import htsjdk.samtools.SAMRecord;
import org.broadinstitute.gatk.utils.commandline.Argument;

/**
 * Filter out forward reads based on length
 *
 * <p>This filter is useful for running on only forward reads (the first of a pair) that are longer (or shorter) than the given threshold sizes.
 *  Implemented, because for the stacks RAD-Seq analysis pipeline, all forward reads need to have the exact same length.</p>
 *
 * <h3>Usage example</h3>
 *
 * <pre>
 *     java -jar GenomeAnalysisTk.jar \
 *         -T ToolName \
 *         -R reference.fasta \
 *         -I input.bam \
 *         -o output.file \
 *         -rf ReadLength \
 *         -minForwardRead 50 \
 *         -maxForwardRead 101
 * </pre>
 *
 * @author ries
 * @version 0.1
 */

public class forwardReadLengthFilter extends ReadFilter {
    @Argument(fullName = "maxForwardReadLength", shortName = "maxForwardRead", doc = "Discard forward reads with length greater than the specified value", required = true)
    private int maxReadLength;

    @Argument(fullName = "minForwardReadLength", shortName = "minForwardRead", doc = "Discard forward reads with length shorter than the specified value", required = true)
    private int minReadLength = 1;

    public boolean filterOut(SAMRecord read) {
        // check the length
        return (((read.getReadLength() > maxReadLength) || (read.getReadLength() < minReadLength)) && read.getFirstOfPairFlag());
    }
}

By importing the SAMRecord, you can use all the nice getter Methods, to check the properties of your reads. I only needed to check if the read was the first read of a read pair (getFirstOfPairFlag()) and if it is to long (getReadLength() > maxReadLength) or to short (getReadLength() < minReadLength). But if you want to design your own filters (and you surely do) have a look here:

SAM record methods

The maxReadLength and minReadLength values are commited to the read filter via the @Argument lines.

All the real filtering is done by the ReadFilter class we extended.

To really use the filter, it has to be compiled and incorporated into the Queue.jar file. This is done by calling from the terminal


1
mvn clean; mvn verify


in the gatk-protected folder.

This takes a while (so have a Martini), but finally the new Queue.jar can be found in:




1
 ../gatk-protected/target/Queue.jar

So weit so gut.

 Now I had to actually use the new filter. The easiest way for this is to run a "PrintReads" instance with the filter: I created a "FilterUniqBAM.scala" (never mind the name) in a directory where I store my RAD-Seq related scripts:


1
/home/ries/gatk-protected/public/gatk-queue-extensions-public/src/main/scala/org/broadinstitute/gatk/queue/extensions/RADseq/filterUniq/FilterUniq2.scala





 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
package org.broadinstitute.gatk.queue.extensions.RADseq.filterUniq
import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._
import org.broadinstitute.gatk.queue.util.QScriptUtils
/**
  * Created by ries on 10/12/16.
  */

class FilterUniqBAM extends QScript{
  @Input(doc="File containing a list of input SAM or BAM files to analyze. Files must be coordinate sorted.", shortName = "I", fullName = "input_bam_files", required = true)
  //var bamFiles: Seq[File] = Nil
  var bamFiles: File = _

  @Input(doc="The reference file for the bam files.", shortName="R", required=true)
  var referenceFile: File = _

  @Argument(fullName = "maxForwardReadLength", shortName = "maxForwardRead", doc="Discard forward reads with length greater than the specified value", required=false)
  var maxForwardLength: Int = 1;

  @Argument(fullName = "minForwardReadLength", shortName = "minForwardRead", doc="Discard forward reads with length shorter than the specified value", required=false)
  var minForwardLength: Int = 100000;

  def script() {

    val bamFilesList = QScriptUtils.createSeqFromFile(bamFiles)
    for (bamFile <- bamFilesList){
      val filterBam = new PrintReads with forwardReadLength

      filterBam.input_file = Seq(bamFile)
      filterBam.reference_sequence = referenceFile
      filterBam.out = swapExt(bamFile,".bam",".uniq.bam")
      filterBam.maxForwardReadLength = maxForwardLength
      filterBam.minForwardReadLength = minForwardLength
      add(filterBam)
    }
  }

}




The "with" statement allows to directly set the filter arguments via "maxForwardReadLength" and
"minForwardReadLength" (remember, we defined those in the forwardReadLengthFilter.java class).

Finally, to run the scala script on the cluster:



1
java -jar ~/gatk-protected/target/Queue.jar -S  /home/ries/gatk-protected/public/gatk-queue-extensions-public/src/main/scala/org/broadinstitute/gatk/queue/extensions/RADseq/filterUniq/FilterUniqBAM.scala -bsub  -I P2_A04_a.rmdup.bam -R /biodata/irg/grp_stich/RAD-Seq_ries/data/B.napus_reference/referencesequence.fa -run --maxForwardReadLength 90 --minForwardReadLength 88

This generated me a file "P2_A04_a.rmdup.uniq.bam" with forward exactly 89bp in length, and reverse reads of all kinds of length. Keep in mind, that some of the reverse reads lost their forward pair, though. This means, that they were properly paired when mapped to the reference sequence, but are not anymore.


Cheers,

Dave


Keine Kommentare:

Kommentar veröffentlichen