Bioinformatics Recipe Cookbook

Recipe Description

Your mission, should you choose to accept it, is to reproduce a Science paper in just 60 seconds.

This recipe is based on a fun diversion that I show during lectures. The goal of the recipe is to demonstrate that, if researchers were properly documenting their work, it would be possible to reproduce their findings with minimal effort and sometimes under a minute.

This recipe reproduces the main results published in the work:

On a properly set up system, and depending on bandwidth, the recipes completes under a minute. This includes getting the data, references, aligning and calling variants! The results even include annotations of the variant effects relative to the features on the genome.

The best way to enjoy this analysis is to have the following video playing while the analyis is running:

You win if you can finish running the analysis before the music stops.

Note how this theme song makes even the otherwise bland analysis feel super exciting.

Lectures

* coming soon

Recipe Code | Recipe Description

Download Recipe

# Stop on any error. Show the commands.
set -uex

# NCBI Bioproject ID.
ID=PRJNA257197

# The number of samples to process.
N=5

# NCBI accession number for the reference genome.
ACC=KJ660346

# The reference data in GENBANK format.
GENBANK=refs/${ACC}.gb

# The reference data in FASTA format.
FASTA=refs/${ACC}.fa

# The reference data in GFF format.
GFF=refs/${ACC}.gff

# The directory that stores the reference files.
mkdir -p refs

# Get the reference sequence in GenBank format.
efetch -db=nuccore -format=gb -id=$ACC > $GENBANK

# Reformat GenBank to FASTA.
cat $GENBANK | readseq -p -f fasta > $FASTA 2>> log.txt

# Reformat GenBank to GFF. Keep only lines that match CDS.
cat $GENBANK | readseq -p -f gff 2>> log.txt | grep CDS > $GFF

# Index the FASTA file for bwa.
bwa index $FASTA 2>> log.txt

# Index the FASTA file for IGV.
samtools faidx $FASTA >> log.txt

# Obtain the SRR run number for the data deposited into the BioProject
esearch -db sra -query $ID | efetch -format runinfo > runinfo.csv

# Keep the top N SRR numbers for now.
cat runinfo.csv  | cut -f 1 -d , | grep SRR | head -$N  > samples.txt

# Make a directory for the sequencing data.
mkdir -p reads

# Limit each dataset to these many reads.
LIMIT=100000

# Download the sequencing data in parallel.
cat samples.txt | parallel "fastq-dump -X $LIMIT -O reads --split-files {} >> log.txt"

# Generate alignments
mkdir -p bam

# Run the bwa mem processes in parallel.
# Put an echo in front of the command to understand what it does.
cat samples.txt | parallel "bwa mem $FASTA reads/{}_1.fastq reads/{}_2.fastq 2>> log.txt" '|' "samtools sort" '>' "bam/{}.bam 2>> log.txt"

# Unix trick: once your commands become more advances redirecting inputs can be
# a bit challenging. Here we want to run samples in parallel, redirect each parallel process into
# its own file while collecting the overall messages into the log.txt file.
#
# Note the use of double and single quotes above to "protect" certain meaning as it is passed down into
# the shell. It always takes me a few minutes to remember how to use these. For debugging put an echo in front of each command
# and troubleshoot that way.

# Index the BAM files.
cat samples.txt | parallel "samtools index bam/{}.bam 2>> log.txt"

# Call variants on all BAM files in parallel.
bcftools mpileup -Ou -f $FASTA  bam/*.bam 2>> log.txt | bcftools call --ploidy 1 -vm -Ou |  bcftools norm -Ov -f $FASTA -d all - > all_variants.vcf

# Annotate the variants with SNPeff.
cat all_variants.vcf | snpEff ebola_zaire -s all_effects_report.html > all_variants_annotated.vcf

Powered by the release 1.3