So here are my fickle attempts to become a GATK developer. One at a time.
Today I needed to remove duplicate reads from a number of bam files of a RAD-Seq experiment. Removing duplicate reads is not in general recommended for RAD-Seq experiments, but in some cases it is:
"Because of the mechanical shearing step, this method can also be used to identify PCR duplicates in sequence data generated using original RADseq with paired-end sequencing" (Andrews 2016, DOI: 10.1038/nrg.2015.28)Since I have hundreds of files, I wanted to distribute the work on our Dell cluster, running "IBM Platform LSF 8.3.0.196409" (bsub).
So, to use GATKs Queue, I had to write a scala script "removeDuplicates.scala".
The main points were:
- MarkDuplicates has to be imported from the picard extension of queue
- the input is a file containing a list of bam files
- duplicates are not merely marked, but removed, thus the name of the script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | package org.broadinstitute.gatk.queue.extensions.RADseq import org.broadinstitute.gatk.queue.QScript import org.broadinstitute.gatk.queue.extensions.picard.MarkDuplicates import org.broadinstitute.gatk.queue.util.QScriptUtils /** * Created by ries on 10/12/16. */ class removeDuplicates extends QScript{ @Input(doc="File containing a list of input SAM or BAM files to analyze. Files must be coordinate sorted.", shortName = "I", fullName = "input_bam_files", required = true) //var bamFiles: Seq[File] = Nil var bamFiles: File = _ @Argument(doc="If true do not write duplicates to the output file instead of writing them with appropriate flags set.", shortName = "remdup", fullName = "remove_duplicates", required = false) var REMOVE_DUPLICATES: Boolean = false @Argument(doc = "Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the 'ulimit -n' command on a Unix system.", shortName = "max_file_handles", fullName ="max_file_handles_for_read_ends_maps", required=false) var MAX_FILE_HANDLES_FOR_READ_ENDS_MAP: Int = -1 @Argument(doc = "This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections. If you are running out of memory, try reducing this number.", shortName = "sorting_ratio", fullName = "sorting_collection_size_ratio", required = false) var SORTING_COLLECTION_SIZE_RATIO: Double = -1 def script() { val bamFilesList = QScriptUtils.createSeqFromFile(bamFiles) for (bamFile <- bamFilesList){ val dedupedBam = new MarkDuplicates dedupedBam.input = Seq(bamFile) dedupedBam.output = swapExt(bamFile,".bam",".rmdup.bam") dedupedBam.metrics = swapExt(bamFile,".bam",".rmdup.metrics") dedupedBam.REMOVE_DUPLICATES = true add(dedupedBam) } } } |
You can run the script like this:
java -jar Queue.jar -S removeDuplicates.scala -bsub -I bamlist.input -run
For each input file, it will create a file with the duplicates removed ending in ".rmdup.bam" as well es a metrics file ending in ".rmdup.metrics" and a file containing job specific information ending in ".out". The bam file is indexed for convenience.
For the example input file
"P1_H08_b.bam",
it will generate
"P1_H08_b.rmdup.bam", "P1_H08_b.rmdup.bai", "P1_H08_b.rmdup.metrics", and "P1_H08_b.rmdup.bam.out".
Regarding the input: Normally, these scripts want each input bam file fed seperately with the "-I" option. If you have large numbers of files this can become tedious. So I decided to change the input behavior. The input can now also be a text file containing one bam a row. For example:
bamlist.input
P1_A07_a.bam P1_A07_b.bam P1_A08_a.bam P1_A08_b.bam P1_A09_a.bam P1_A09_b.bam P1_A10_a.bam P1_A10_b.bam P1_A11_a.bam P1_A11_b.bam P1_A12_a.bam P1_A12_b.bam P1_B07_a.bam P1_B07_b.bam P1_B08_a.bam P1_B08_b.bam P1_B09_a.bam P1_B09_b.bam
All the files named in bamlist.input will be processed in parallel, provided the cluster has some resources to spare.
The old behavior with '-I P1_A07_a.bam -I P1_A07_b.bam ...' and so on still works.
Thats it.
Keine Kommentare:
Kommentar veröffentlichen