Bioinformatics Recipe Cookbook

Recipe Description

The recipe demonstrates the use of different short read aligners.

There substantially many more differences between aligners beyond their "performance" or so-called "accuracy". Equally important is how we can customize the alignments and what additional information is produced in each of the alignments.

The recipe generates SAM alignment files with different algorithms bwa, bowtie2 in both single end and paired-end mode.

  1. Downloads the 1976 Ebola - Mayinga Reference Genome
  2. Downloads paired-end sequencing data for the 2014 outbreak
  3. Generates single-end sequence alignments with bwa and bowtie2
  4. Generates paired-end sequence alignments with bwa and bowtie2
  5. Runs samtools flagstat on each of the SAM files.

Readers are invited to investiage the SAM files to identify the differences between aligner outputs.

Lecture

A detailed presentation that explains the steps and rationale for this recipe can be found at:

Please refer to the lecture above for study materials and additional content.

Recipe Code | Recipe Description

Download Recipe

#
# In this script some of the tools are overly verbose and print lots of useless information
# as they run. This can be quite annoying. In those cases we redirect and append the stdout and 
# stderr into the log.txt file  using:  1>>log.txt 2>>log.txt
# We keep the output in case we need to troubleshoot.
#

# Stop on errors.
set -uex

# UNIX trick: Adding the -x flag will also show the commands as these are executed.
# This can help you track down the errors as these happen.

# There will be many ancillary files.
# Better keep them out of sight.
mkdir -p db

#
# When run from command line use the second lines to set the parameters.
#
# Reference genome accession number.
ACC=AF086833

# The SRR number for the sequencing data.
SRR=SRR1972739

# This will be the reference genome we align against.
REF=db/${ACC}.fa

# It is best to build all variables at the top ot the program.
# That way you see how these are constructed.
READ1=${SRR}_1.fastq
READ2=${SRR}_2.fastq

# Get the reference genome. AF086833  - Mayinga Ebola strain from 1976
efetch -db=nuccore -format=fasta -id=${ACC} > $REF

# Obtain the sequencing data from SRA.
fastq-dump -X 10000 --split-files $SRR 1>> log.txt 2>> log.txt

# Build a bowtie2 index. This only needs to be done once per reference file.
bowtie2-build $REF $REF 1>> log.txt 2>> log.txt

# Build a bwa index. This only needs to be done once per reference file.
bwa index $REF  1>> log.txt 2>> log.txt

# Run the bowtie2 aligner in single end mode.
bowtie2 -x $REF -U $READ1 > bowtie2.SE.sam 2>> log.txt
samtools index bowtie2.SE.sam

# Run the bwa aligner in single end mode.
bwa mem $REF $READ1 > bwa.SE.sam 2>> log.txt
samtools index bwa.SE.sam

# The flagstat tool produces a report on the alignment file..
echo "*** BOWTIE2 SE flagstat:"
samtools flagstat bowtie2.SE.sam

echo "*** BWA SE flagstat:"
samtools flagstat bwa.SE.sam

# Run the bowtie2 aligner in paired end mode.
bowtie2 -x $REF -1 $READ1 -2 $READ2 > bowtie2.PE.sam 2>> log.txt
samtools index bowtie2.PE.sam

# Run the bowtie2 aligner in paired end mode.
bwa mem $REF $READ1 $READ2 > bwa.PE.sam 2>> log.txt
samtools index bwa.PE.sam

echo "*** BOWTIE2 PE flagstat:"
samtools flagstat bowtie2.PE.sam

echo "*** BWA PE flagstat:"
samtools flagstat bwa.PE.sam

Powered by the release 1.4