Recipe View

Mission Impossible: Multi Sample Variant Calling in 60 seconds

Your mission, should you choose to accept it, is to reproduce major findings of a Science paper in …

1 result • updated 3.7 years ago by Istvan Albert

Your mission, should you choose to accept it, is to reproduce major findings of a Science paper in just 60 seconds.

This recipe is based on a fun diversion that I show during lectures. The goal of the recipe is to demonstrate that, when properly documented it is possible to reproduce major scientific findings with relative ease and perhaps under a minute.

This recipe reproduces the main results published in the work:

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak published in 2014 in the Science research journal.

On a properly set up system, and depending on bandwidth, the recipe completes under a minute. This includes getting the data, references, aligning and calling variants! The results even include annotations of the variant effects relative to the features on the genome.

The best way to enjoy this analysis is to have the following video playing while the analysis is running:

https://www.youtube.com/watch?v=yLo9k8Gifb0

Note how this theme song makes even the otherwise bland analysis feel super exciting.

Copy recipe

You need write access to the project to edit.

# Stop on any error. Show the commands.
set -uex

# NCBI Bioproject ID.
ID=PRJNA257197

# The number of samples to process.
N=5

# NCBI accession number for the reference genome.
ACC=KJ660346

# The reference data in GENBANK format.
GENBANK=refs/${ACC}.gb

# The reference data in FASTA format.
FASTA=refs/${ACC}.fa

# The reference data in GFF format.
GFF=refs/${ACC}.gff

# The directory that stores the reference files.
mkdir -p refs

# Get the reference sequence in GenBank format.
efetch -db=nuccore -format=gb -id=$ACC > $GENBANK

# Reformat GenBank to FASTA.
cat $GENBANK | readseq -p -f fasta > $FASTA 2>> log.txt

# Reformat GenBank to GFF. Keep only lines that match CDS.
cat $GENBANK | readseq -p -f gff 2>> log.txt | grep CDS > $GFF

# Index the FASTA file for bwa.
bwa index $FASTA 2>> log.txt

# Index the FASTA file for IGV.
samtools faidx $FASTA >> log.txt

# Obtain the SRR run number for the data deposited into the BioProject
esearch -db sra -query $ID | efetch -format runinfo > runinfo.csv

# Keep the top N SRR numbers for now.
cat runinfo.csv  | cut -f 1 -d , | grep SRR | head -$N  > samples.txt

# Make a directory for the sequencing data.
mkdir -p reads

# Limit each dataset to these many reads.
LIMIT=100000

# Download the sequencing data in parallel.
cat samples.txt | parallel "fastq-dump -X $LIMIT -O reads --split-files {} >> log.txt"

# Generate alignments
mkdir -p bam

# Run the bwa mem processes in parallel.
# Put an echo in front of the command to understand what it does.
cat samples.txt | parallel "bwa mem $FASTA reads/{}_1.fastq reads/{}_2.fastq 2>> log.txt | samtools sort > bam/{}.bam 2>> log.txt"

# Unix trick: once your commands become more advances redirecting inputs can be
# a bit challenging. Here we want to run samples in parallel, redirect each parallel process into
# its own file while collecting the overall messages into the log.txt file.

# Index the BAM files.
cat samples.txt | parallel "samtools index bam/{}.bam 2>> log.txt"

# Call variants on all BAM files in parallel.
bcftools mpileup -Ou -f $FASTA  bam/*.bam 2>> log.txt | bcftools call --ploidy 1 -vm -Ou |  bcftools norm -Ov -f $FASTA -d all - > all_variants.vcf

# Annotate the variants with SNPeff.
cat all_variants.vcf | snpEff ebola_zaire -s all_effects_report.html > all_variants_annotated.vcf

You need write access to the original recipe to edit.

Click the buttons on the right to create new fields.

Add text field Add float field Add data field Add checkbox Add dropdown Add upload field Add integer field Add radio button

Name

Recipe display name

Identifier

Unique identifier for the recipe.

Image :

Optional image for the recipe ( 500px Maximum ).

Rank:

Used to order recipes (optional).

Your mission, should you choose to accept it, is to reproduce major findings of a Science paper in just 60 seconds.

This recipe is based on a fun diversion that  I show during lectures.  The goal of the recipe is to demonstrate that, when properly documented it is possible to reproduce major scientific findings with relative ease and perhaps under a minute.

This recipe reproduces the main results  published in the work:

* [Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak][pub] published in 2014 in the Science research journal.

On a properly set up system, and depending on bandwidth, the recipe completes under a minute. This includes getting the data, references, aligning and calling variants! The results even include annotations of the variant effects relative to the features on the genome.

The best way to enjoy this analysis is to have the following video playing while the analysis is running:

https://www.youtube.com/watch?v=yLo9k8Gifb0

Note how this theme song makes even the otherwise bland analysis feel super exciting.

[pub]: http://science.sciencemag.org/content/early/2014/08/27/science.1259657.full

Insert Image

From the web

From your computer

Cancel

Back