Everything useful and less: 2016

Dienstag, 22. November 2016

GATK use-cases: Getting started with Queue and IntelliJ IDEA

If you use the GATK, at some point you might want to start

using Queue

to

a) build pipelines, thus saving time and effort while rerunning your analysis and making them less error-prone
b) use a compute cluster like LSF or Univa Grid Engine to distribute your jobs, potentially speeding up your pipeline many-fold.

I started out with just downloading the 'Queue.jar' from GATK and writing my qscripts in a simple text editor. Most of my time was lost on finding the right functions or classes to use and debugging the qscripts.

To efficiently write your own qscripts you should use the gatk development version in combination with the IntelliJ IDEA. Properly set up, IntelliJ is tremendously helpful, by automatically generating import statements, suggesting valid class functions, code highlighting and much more.

I had a hard time making my first steps to a working qscript, so I give some examples (see the other GATK use-cases).

Here is a

short summary of how to set up the development environment.

I am on Debian 8. I know it works likewise on 7 (since our compute cluster has 7 installed), but generally recommend the latest stable release of your Unix operating system as well as GATK (GATK 3.x that is. This all will not work on the upcomming GATK 4 release).

You also need java 8. openjdk works for me as well as oracle java.
If you need java 7 for most of your applications, than export java 8 temporarily to your path. For me it looks like this (shell command):

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$JAVA_HOME/bin:$PATH

Although it is 'only for developers' use the GATK-protected repository!

So get the latest GATK-protected release here:
https://github.com/broadgsa/gatk-protected.

And the latest IntelliJ IDEA here:
https://www.jetbrains.com/idea/

If you don't already have i, get and install maven:
http://maven.apache.org/install.html

Check if Java home is set correctly to java 8:

mvn --version

Now go to the gatk-protected directory, and compile with:

mvn clean
mvn verify

This will compile the whole thing, and produce a GenomeAnalysisTK.jar and Queue.jar in

gatk-protected/target/

Later on, when you write your own classes, you need to recompile your version of gatk, for your changes to take effect. This can aghain be done with --verify, or faster with "mvn -Ddisable.shadepackage verify". MORE INFORMATION

Now to set up the IntelliJ IDEA

(modified from here):

Run mvn test-compile in your git clone's root directory
Open IntelliJ File -> import project, select your git clone directory, then click "ok" On the next screen, select "import project from external model", then "maven", then click "next" Click "next" on the next screen without changing any defaults -- in particular:

DON'T check "Import maven projects automatically"
DON'T check "Create module groups for multi-module maven projects"

On the "Select Profiles" screen, make sure private and protected ARE checked, then click "next".
On the next screen, the "gatk-aggregator" project should already be checked for you -- if not, then check it.
Click "next".
Select the 1.8 SDK, then click "next".
Select an appropriate project name (can be anything), then click "next" (or "finish", depending on your version of IntelliJ).
Click "Finish" to create the new IntelliJ project.

It should look something like this:

Using qscripts

I haven't found a recommendation for this, but suggest you create your new qscripts (scala scripts) in

~/gatk-protected/protected/gatk-queue-extensions-distribution/src/main/qscripts/org/broadinstitute/gatk/queue/qscripts/

Putting them in other folders eventually results in them beeing "hard coded"
during compilation, so that code changes won't take effect until you compile the whole thing again.

To make this more clear:
In general, you have to compile only after creating new java or scala classes, like a new filter or walker, not for a pipeline qscript.

Now you are good to go.
To get an overview (or rather a glimpse) read the recent threads on GATK development and pipelining, but make sure to go through the comments. Many of the threads refer to old GATK versions, when ant was used instead of maven, and before the sting-to-gatk-renaming, so keep that in mind.

For starters, you could play around with the HaplotypeCaller qscript in

~/gatk-protected/protected/gatk-queue-extensions-distribution/src/main/qscripts/org/broadinstitute/gatk/queue/qscripts/examples/ExampleHaplotypeCaller.scala

You only need a reference sequence and an accordingly mapped bam file to test it. Ideally, starting it (on a LSF cluster) would be as easy as

java  -jar ~/gatk-protected/target/Queue.jar -S ~/gatk-protected/protected/gatk-queue-extensions-distribution/src/main/qscripts/org/broadinstitute/gatk/queue/qscripts/examples/ExampleHaplotypeCaller.scala -O raw.vcf -R ref.fa -I reads.bam  -run

If this works, you can start playing around with the HaplotypeCaller Parameters, or processing multiple input files at once ('-I 1.bam -I 2.bam').

Donnerstag, 13. Oktober 2016

GATK use-cases: Implementing a new read filter

Another day, another GATK hack.

Stacks, the tool I use to analyze my RAD-Seq data has the constraint, that it should only be run with reads of the exact same length to calculate the rad tags. Depending on your setup, this concerns forward and reverse reads (in case of a double cut RAD-Seq), or forward reads alone and the reverse reads are not used to calculate RAD stacks. So why not just filter for reads of the exact length?
Well, most read filtering tools accept a range of length for both reads. So the whole read-pair is filtered out, if one of the reads doesn't match the length criteria, even if the other read was perfectly fine.
What I needed was a filter, that made it possible to filter on criteria that were different for the two respective reads of a pair. This could rather easily be done with python and ngs_plumbing to filter the fastq files before mapping. But I wanted to filter the reads after mapping. Here's why:
One of the major advantages of paired reads over single reads is the more accurate mapping, because the information of both reads of a pair is considered to find the best matching position. Therefore people (me included) tend to use
only the properly paired reads for their analysis. By filtering the unmapped reads from the fastq files, I lose both reads of a pair, if the forward read is filtered out due to the length constraints. In contrast, when filtering after the mapping, it is okay to retain the reverser read, because the information of the forward read has already been used to find the best matching position. Thereby, I lose only half of the reads by filtering after than before the mapping.

Since I use GATK and Queue a lot I set myself on implementing such a read filter in GATK.

If you have gatk-protected-3.6 and IntelliJ IDEA installed (using java8), you're ready to go.

First, I created a new java class "forwardReadLengthFilter.java" in

gatk-protected/public/gatk-engine/src/main/java/org/broadinstitute/gatk/engine/filters/

package org.broadinstitute.gatk.engine.filters;
import htsjdk.samtools.SAMRecord;
import org.broadinstitute.gatk.utils.commandline.Argument;

/**
 * Filter out forward reads based on length
 *
 * <p>This filter is useful for running on only forward reads (the first of a pair) that are longer (or shorter) than the given threshold sizes.
 *  Implemented, because for the stacks RAD-Seq analysis pipeline, all forward reads need to have the exact same length.</p>
 *
 * <h3>Usage example</h3>
 *
 * <pre>
 *     java -jar GenomeAnalysisTk.jar \
 *         -T ToolName \
 *         -R reference.fasta \
 *         -I input.bam \
 *         -o output.file \
 *         -rf ReadLength \
 *         -minForwardRead 50 \
 *         -maxForwardRead 101
 * </pre>
 *
 * @author ries
 * @version 0.1
 */

public class forwardReadLengthFilter extends ReadFilter {
    @Argument(fullName = "maxForwardReadLength", shortName = "maxForwardRead", doc = "Discard forward reads with length greater than the specified value", required = true)
    private int maxReadLength;

    @Argument(fullName = "minForwardReadLength", shortName = "minForwardRead", doc = "Discard forward reads with length shorter than the specified value", required = true)
    private int minReadLength = 1;

    public boolean filterOut(SAMRecord read) {
        // check the length
        return (((read.getReadLength() > maxReadLength) || (read.getReadLength() < minReadLength)) && read.getFirstOfPairFlag());
    }
}

By importing the SAMRecord, you can use all the nice getter Methods, to check the properties of your reads. I only needed to check if the read was the first read of a read pair (getFirstOfPairFlag()) and if it is to long (getReadLength() > maxReadLength) or to short (getReadLength() < minReadLength). But if you want to design your own filters (and you surely do) have a look here:

SAM record methods

The maxReadLength and minReadLength values are commited to the read filter via the @Argument lines.

All the real filtering is done by the ReadFilter class we extended.

To really use the filter, it has to be compiled and incorporated into the Queue.jar file. This is done by calling from the terminal

1	mvn clean; mvn verify

in the gatk-protected folder.

This takes a while (so have a Martini), but finally the new Queue.jar can be found in:

1	../gatk-protected/target/Queue.jar

So weit so gut.

Now I had to actually use the new filter. The easiest way for this is to run a "PrintReads" instance with the filter: I created a "FilterUniqBAM.scala" (never mind the name) in a directory where I store my RAD-Seq related scripts:

1	/home/ries/gatk-protected/public/gatk-queue-extensions-public/src/main/scala/org/broadinstitute/gatk/queue/extensions/RADseq/filterUniq/FilterUniq2.scala

package org.broadinstitute.gatk.queue.extensions.RADseq.filterUniq
import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._
import org.broadinstitute.gatk.queue.util.QScriptUtils
/**
  * Created by ries on 10/12/16.
  */

class FilterUniqBAM extends QScript{
  @Input(doc="File containing a list of input SAM or BAM files to analyze. Files must be coordinate sorted.", shortName = "I", fullName = "input_bam_files", required = true)
  //var bamFiles: Seq[File] = Nil
  var bamFiles: File = _

  @Input(doc="The reference file for the bam files.", shortName="R", required=true)
  var referenceFile: File = _

  @Argument(fullName = "maxForwardReadLength", shortName = "maxForwardRead", doc="Discard forward reads with length greater than the specified value", required=false)
  var maxForwardLength: Int = 1;

  @Argument(fullName = "minForwardReadLength", shortName = "minForwardRead", doc="Discard forward reads with length shorter than the specified value", required=false)
  var minForwardLength: Int = 100000;

  def script() {

    val bamFilesList = QScriptUtils.createSeqFromFile(bamFiles)
    for (bamFile <- bamFilesList){
      val filterBam = new PrintReads with forwardReadLength

      filterBam.input_file = Seq(bamFile)
      filterBam.reference_sequence = referenceFile
      filterBam.out = swapExt(bamFile,".bam",".uniq.bam")
      filterBam.maxForwardReadLength = maxForwardLength
      filterBam.minForwardReadLength = minForwardLength
      add(filterBam)
    }
  }

}

The "with" statement allows to directly set the filter arguments via "maxForwardReadLength" and
"minForwardReadLength" (remember, we defined those in the forwardReadLengthFilter.java class).

Finally, to run the scala script on the cluster:

java -jar ~/gatk-protected/target/Queue.jar -S  /home/ries/gatk-protected/public/gatk-queue-extensions-public/src/main/scala/org/broadinstitute/gatk/queue/extensions/RADseq/filterUniq/FilterUniqBAM.scala -bsub  -I P2_A04_a.rmdup.bam -R /biodata/irg/grp_stich/RAD-Seq_ries/data/B.napus_reference/referencesequence.fa -run --maxForwardReadLength 90 --minForwardReadLength 88

This generated me a file "P2_A04_a.rmdup.uniq.bam" with forward exactly 89bp in length, and reverse reads of all kinds of length. Keep in mind, that some of the reverse reads lost their forward pair, though. This means, that they were properly paired when mapped to the reference sequence, but are not anymore.

Cheers,

Dave

Mittwoch, 12. Oktober 2016

GATK use-cases: removing duplicate reads with queue

I have been using GATK for some years, and somehow managed to glue a queue-based bam -> vcf pipeline together, I decided that it is time to introduce myself to the wonderful world of queue development.

So here are my fickle attempts to become a GATK developer. One at a time.

Today I needed to remove duplicate reads from a number of bam files of a RAD-Seq experiment. Removing duplicate reads is not in general recommended for RAD-Seq experiments, but in some cases it is:

"Because of the mechanical shearing step, this method can also be used to identify PCR duplicates in sequence data generated using original RADseq with paired-end sequencing" (Andrews 2016, DOI: 10.1038/nrg.2015.28)

Since I have hundreds of files, I wanted to distribute the work on our Dell cluster, running "IBM Platform LSF 8.3.0.196409" (bsub).

So, to use GATKs Queue, I had to write a scala script "removeDuplicates.scala".

The main points were:

MarkDuplicates has to be imported from the picard extension of queue
the input is a file containing a list of bam files
duplicates are not merely marked, but removed, thus the name of the script

The implementation of "removeDuplicates.scala" is as follows:

package org.broadinstitute.gatk.queue.extensions.RADseq

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.picard.MarkDuplicates
import org.broadinstitute.gatk.queue.util.QScriptUtils

/**
  * Created by ries on 10/12/16.
  */
class removeDuplicates extends QScript{
  @Input(doc="File containing a list of input SAM or BAM files to analyze. Files must be coordinate sorted.", shortName = "I", fullName = "input_bam_files", required = true)
  //var bamFiles: Seq[File] = Nil
 var bamFiles: File = _
  @Argument(doc="If true do not write duplicates to the output file instead of writing them with appropriate flags set.", shortName = "remdup", fullName = "remove_duplicates", required = false)
  var REMOVE_DUPLICATES: Boolean = false

  @Argument(doc = "Maximum number of file handles to keep open when spilling read ends to disk.  Set this number a little lower than the per-process maximum number of file that may be open.  This number can be found by executing the 'ulimit -n' command on a Unix system.", shortName = "max_file_handles", fullName ="max_file_handles_for_read_ends_maps", required=false)
  var MAX_FILE_HANDLES_FOR_READ_ENDS_MAP: Int = -1

  @Argument(doc = "This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections.  If you are running out of memory, try reducing this number.", shortName = "sorting_ratio", fullName = "sorting_collection_size_ratio", required = false)
  var SORTING_COLLECTION_SIZE_RATIO: Double = -1


  def script() {

    val bamFilesList = QScriptUtils.createSeqFromFile(bamFiles)
    for (bamFile <- bamFilesList){
      val dedupedBam = new MarkDuplicates

      dedupedBam.input = Seq(bamFile)
      dedupedBam.output = swapExt(bamFile,".bam",".rmdup.bam")
      dedupedBam.metrics = swapExt(bamFile,".bam",".rmdup.metrics")
      dedupedBam.REMOVE_DUPLICATES = true
      add(dedupedBam)
    }
  }

}

You can run the script like this:

java -jar Queue.jar -S  removeDuplicates.scala -bsub  -I bamlist.input -run

For each input file, it will create a file with the duplicates removed ending in ".rmdup.bam" as well es a metrics file ending in ".rmdup.metrics" and a file containing job specific information ending in ".out". The bam file is indexed for convenience.

For the example input file

"P1_H08_b.bam",

it will generate

"P1_H08_b.rmdup.bam", "P1_H08_b.rmdup.bai", "P1_H08_b.rmdup.metrics", and "P1_H08_b.rmdup.bam.out".

Regarding the input: Normally, these scripts want each input bam file fed seperately with the "-I" option. If you have large numbers of files this can become tedious. So I decided to change the input behavior. The input can now also be a text file containing one bam a row. For example:

bamlist.input

P1_A07_a.bam
P1_A07_b.bam
P1_A08_a.bam
P1_A08_b.bam
P1_A09_a.bam
P1_A09_b.bam
P1_A10_a.bam
P1_A10_b.bam
P1_A11_a.bam
P1_A11_b.bam
P1_A12_a.bam
P1_A12_b.bam
P1_B07_a.bam
P1_B07_b.bam
P1_B08_a.bam
P1_B08_b.bam
P1_B09_a.bam
P1_B09_b.bam

All the files named in bamlist.input will be processed in parallel, provided the cluster has some resources to spare.
The old behavior with '-I P1_A07_a.bam -I P1_A07_b.bam ...' and so on still works.

Thats it.

Mittwoch, 31. August 2016

How to make an image with fading round edges in GIMP

For a pirate-themed Party, I wanted to send some nicely printed
invitation cards. They should be something special. One of the features I wanted were photos of me and my friends dressed up as pirates on the card.
But they should blur into the background.
Took me some time to figure out how to do it in Gimp, but here:

Open the photo.
Filters->Decor->Old Photo
Layer->Transparency->Add alpha channel
Elipse select tool
Select->Invert
Select->Feather
delete
Then copy and paste the photo as a new layer to the card.

As a background, I used my local area from
http://www.yarrmaps.com/

Mittwoch, 6. April 2016

Integrate raspberry pi 3 into spark cluster

I have the spark cluster already set up, so this is only about setting up a fresh raspberry pi 3 as a worker in the cluster.

This is based on setting-up-a-standalone-apache-spark-cluster-of-raspberry-pi-2

Installing raspbian
Put the sd card (I use a 16 GB Toshiba ADP-HS02) in your card reader.
I then use lsblk to find out the mount path of the sd card:

In my case it's sdb1, so that (sdb) is where I want to put the raspbian image:

sudo dd bs=1M if=~/Downloads/2016-02-09-raspbian-jessie.img of=/dev/sdb

This can take some time. When it's done unmount the sd card.

Physically attaching the pi to the cluster

Put the sd card in your pi.
Connect the lan port to the switch.
Power up the pi.

Configuring the pi
ssh into the headnode/master of your cluster (assuming it is a standalone cluster). I gave it the hostname "pi-headnode", so I can access it via:
ssh pi@pi-headnode

You can use nmap to scan the cluster for nodes.
As a reminder: This is a standalone cluster. So it has its own network, with its own IP addresses. The headnode is the bridge between the cluster, and the main network. From the headnode, both networks (the main network and the cluster) can be accessed. Use ifconfig to find out the IP addresses of the networks. eth0 will probably be connected to the cluster, and eth1 to the main network, but this depends on your setup. In my case, the IP of the main network is 192.168.0.*, that of the cluster 192.168.1.*. So to scan the cluster I use:
nmap -sn 192.168.1.0/24
and I see the new node.

run the scripts
sh map-network
sh copy-keys
sh upgrade-workers
to integrate the pi into the cluster and upgrade all pis. This can take some time.
Instead of upgrade-workers, you could ssh into the new pi and upgrade it manually with sudo apt-get upgrade.

The pi is now part of the cluster, but I like to configure it a bit more:
ssh into the new pi
start the configuration tool:
sudo raspi-config
and expand the disc space, set boot to console and automatic login to save ram, reduce ram for GPU to 0

copy spark to the cluster
update the slaves.txt
cp ~/workers.txt ~/spark-1.6.1-bin-hadoop2.6/conf/slaves
copy spark to the cluster
scp -r /home/pi/spark-1.6.1-bin-hadoop2.6/ pi@192.168.1.12:/home/pi/spark-1.6.1-bin-hadoop2.6/

Start the slaves
spark-1.6.1-bin-hadoop2.6/sbin/start-slaves.sh
this starts all slaves that are not yet running.

Done!

You will hopefully see something like this on your webUI

Mittwoch, 16. März 2016

free disc space in /boot by removing old kernels

Once in a while Ubuntu can't update, because there is not enough free space in /boot.

To free space, old, unused kernels have to be removed. This can automatically be done with:

sudo apt-get purge $(dpkg -l linux-{image,headers}-"[0-9]*" | awk '/ii/{print $2}' | grep -ve "$(uname -r | sed -r 's/-[a-z]+//')")

From:

https://askubuntu.com/questions/89710/how-do-i-free-up-more-space-in-boot

Sonntag, 21. Februar 2016

configuring raspbian jessy for easy ssh access

Configure your pi with

sudo raspi-config

Within the menu, do the following
- change the name of your pi, for example 'pi1'
- allow ssh access
- change the password (and remember it)

restart the pi

Now, from some other unix computer, you want to scan the local network for connected devices. This can be done with nmap. You also need to know the IP address of your local network, which is normally 192.168.0 or 192.168.1, but better find it out with either ifconfig or ip addr.
Step by step:
- Install nmap

sudo apt-get install nmap

- find your local networks IP address with either

ip addr

ifconfig

joker@joker-Ultrabook:~$ ifconfig
eth0 Link encap:Ethernet HWaddr e8:03:9a:ee:f4:2b
inet addr:192.168.0.131 Bcast:192.168.0.255 Mask:255.255.255.0

In this example, my devices IP is 192.168.0.131, so my local networks address is 192.168.0, 131 being my laptops number.

- search the network with nmap:

nmap 192.168.0.0/24

24 is shorthand for subnet mask 255.255.255.0

As a result, nmap shows you all devices in the network, including your pi. Of course the pi has to be connect to the network via cable or wifi:

Nmap scan report for pi1 (192.168.0.119)
Host is up (0.011s latency).
Not shown: 999 closed ports
PORT STATE SERVICE
22/tcp open ssh

Nice, the pi is there, ssh is open, and its correctly named 'pi1'.
You could use its IP address 192.168.0.119 to ssh into it, but using the name is much easier.

ssh pi@pi1

Freitag, 19. Februar 2016

I checkout back into my master, but it still looks like the branch I was working on?

Short story:

You have to commit your changes in the branch before checking out to the master. So the solution is to checkout back to the branch, commit the changes, and checkout to master afterwards.

Long story:

I started implementing multiprocessing. Naturally, I didn't want to mess up my code, so I created a new branch 'multiprocessing':

git checkout -b multiprocessing

Some hours of biting my nails later, I had it running, and even saw some improvement in speed. I also learned, that implementing multiprocessing in python is easy, but finding the right tasks that can be processed in parallel with speed improvement is hard.
Anyway, the code I wrote was far from clean, but my curiosity was satisfied for now. I decided to put that project aside and get back to it later.
So I checked out to my master:

git checkout master

When I ran the tool later, I realized that it started working on multiple cores. Had I forgotten to checkout? So I had a look:

git branch

and it confirmed that I had correctly checked out:

joker@joker-Ultrabook:~/VCF2AFAnalysis$ git branch
* master
multiprocessing
plot_labels

Still, the code was the one I wrote while checked out to multiprocessing.
So I checked out back to multiprocessing:

git checkout multiprocessing

and commited the changes

git commit -am 'started implementing mp'

After checking out to master everything was alright.

I don't know why git doesn't tell you that you always have to commit your changes before checkout, but well, that's just how he wants it.

Freitag, 22. Januar 2016

How to qrsh into a specific host

Working on a grid engine you sometimes want to work on that one specific host, using qrsh instead of ssh.

Somehow it is not super obvious how to get there, and even google didn't help.

So there:

qrsh -l hostname=waldorf

where "waldorf" is the name of the host you want to log onto.

Freitag, 15. Januar 2016

How to prevent "broken pipe" with ssh (Ubuntu 14.10)

Since I work mostly from home, I use ssh a lot.

The "broken pipe" disrupted my workflow continuously, esp.
when some task running for a couple of minutes was killed.

At first, I fixed it in a rather n00bish way, by running the task in the background
('&' or (ctrl+z and bg)), and keeping ssh alive with a never ending while loop:

while true; do sleep 60; qstat -u ries; done

But according to
https://askubuntu.com/questions/127369/how-to-prevent-write-failed-broken-pipe-on-ssh-connection
there are of course better ways:

send a keepalive message to the server

This is basically the same idea as with the while loop. Every 60 sec or so, a message is send to the server, telling it to keep your ssh connection stable.

edit your

~/.ssh/ssh_config

with:

Host *
ServerAliveInterval 60

use the screen shell for long-running tasks

For tasks running longer than a couple of minutes (in my case it can be days), it doesn't really make much sense to keep a stable ssh connection, and your hoe computer up and running.

A better solution is to start the task in a way, that you can detach it for now, and get back to it later, while it goes on on it's own. Probably the best way to deal with this is screen, although it takes some time to get used to it.

Screen is a powerful utility that allows you to control multiple terminals which will stay alive independently of the ssh session.

While logged in via ssh, start screen

screen

you are now in a new screen shell. Start your command. For me s.th. like this:

python ../bwa_parallel_arrays/bwa_parallel.py -P tmp -I . -t 4 -e error -o out -n 10 > screen_out.txt &

detach the screen shell and get back to your previous terminal:

screen -d

Detach the screen session (disconnect it from the terminal and put it into the background). A detached screen can be resumed by invoking screen with the -r option.

screen and your task keep running on the host you logged on, as can be seen with the top command:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
55557 ries 20 0 3948712 3.686g 6392 R 5.9 1.5 0:41.99 python
32063 ries 20 0 27036 3524 2284 R 1.3 0.0 0:00.23 top
9023 ries 20 0 28976 3076 2440 S 0.0 0.0 0:00.00 screen

reattach your screen session with

screen -r

You can also check the status of your (various) screen sessions with

screen -list

There is a screen on:
9023.pts-8.waldorf (Detached)
1 Socket in /var/run/screen/S-ries.

using 'disown'

Out of convenience, I tend to just start the tasks in the background with '&', and detaching it from the shell with 'disown', so it doesn't get killed when the shell gets killed.

This is very easy, and can be done even after starting the task, by sending it to sleep background with Ctrl + Z, then running in the background with 'bg' and detaching it with 'disown'. This somehow doesn't work well with ssh. When the pipe breaks, the task breaks.

using 'nohup' [update]

Better than using 'disown' is to use 'nohup' (no hangup).
Just prepend it to your command, and it will not terminate if the terminal gets closed.

Dienstag, 5. Januar 2016

GitHub: pushing local changes to the remote version (updated)

After fumbling around for a while, I finally got a grip on how to use GitHub.
If you're as slow a learner as me, it might take you a couple of days to really get used to it, but it's also a skill worthwhile to learn.

For now, I'm just using github for remote access, and version control for my scripts.

I assume, you already have a github project initiated.

The best in-depth explanation I could find:
https://www.atlassian.com/git/tutorials/comparing-workflows/centralized-workflow

Task:

Editing some file of the program locally, then merge it with the remote version.

Get a local copy of your own rep

git clone https://github.com/davidries84/bwa_parallel_arrays.git

Edit some files.

If it is your own rep in which you changed files and you just want to update the master:

git commit -am "change time tracking"

-a means all files you changed will be pushed to the remote repository
-m is used to state the changes you did in a short message

By 'pushing' the changes up to the remote version of your code, you update that remote version:

git push -u origin master

If you added some files, instead of only editing already existing files:

git add .
git commit -am "adding test files"git push -u origin master

Editing the 'master' should only be done for failsave changes (like comments). The master should always be fully functional code. So for implementing new features or bugfixes, you should use 'branches', which get merged into the master, when the new code is ready and finished.

In your local directory, create a new branch for your intended changes.
By creating a new branch, you automatically change (checkout) into that branch.

git checkout -b update_Readme

Do your changes, i.e.

gedit README.md

"commit" your changes:

from the man "git commit" pages:

Stores the current contents of the index in a new commit along with a log message from the user describing the changes.

The content to be added can be specified in several ways:
...
4. by using the -a switch with the commit command to automatically
"add" changes from all known files (i.e. all files that are already
listed in the index) and to automatically "rm" files in the index
that have been removed from the working tree, and then perform the
actual commit;

So we commit everything (-a) and add a message describing the change (-m)

git commit -am "added some text to readme"

To update the repository, we have to "push" the local branch "update_Readme", to the origin.
git push " Updates remote refs using local refs, while sending objects necessary to complete the given refs."

git push origin update_Readme

Afterwards, we can checkout back to the master.

git checkout master

After accepting your own changes (using the website), it makes sense to sync your local master with your remote master, thus bringing it up to date.
git-fetch - Download objects and refs from another repository

git fetch upstream

git-rebase - Forward-port local commits to the updated upstream head

git rebase upstream/master

Sonntag, 3. Januar 2016

using pdb from within Ipython

Start your script from within Ipython

%run yourscript.py

While debugging exeptions, I prefer this over embedding ipython,
because the shell stops right after the exception is thrown.

When the exception is thrown, you get a stack trace, showing the exception.
Now enter the python debugger from within ipython

%debug

You can now use all the pdb features (pdb commands) as well as ipython functionality (think tab completion).

Walk up the stack trace until you get to your faulty code

u

edit your code in your favourite editor (outside ipython)

reload the changed code

%load_ext autoreload
%autoreload 2

Test the changed code, by rerunning the changed function or script.

That's how it should work. Somehow it doesn't for me. The same error gets thrown, although ipython shows the corrected code in the stack trace.
After restarting ipython it works fine ...