Bioinformatics Recipe Cookbook

Recipe Description

Bacterial variant caller

This recipe contains the code presented by Torsten Seemann in the blog post titled A Unix one-liner to call bacterial variants

The code from the above website has been generalized a bit with additional utility.

This recipe provides code that:

  1. Downloads a reference genome (default value: NZ_CP008918 Pasteurella multocida strain ATCC 43137, complete genome)
  2. Downloads a sequencing run from SRA (default value: SRR4124989)
  3. Generates statistics on the sequencing data
  4. Runs minimap2 to align the sequences and create a SAM output
  5. Runs bcftools mpileup to generate the genotype likelihoods of each base
  6. Runs bcftools call to filter for multiallelic variants only
  7. Runs bcftools norm to normalize each variant to a standard form
  8. Runs bcftools filter to remove variants with low quality (QUAL) or low coverage (DP)

Recipe Code | Recipe Description

Download Recipe

# Stop on errors.
set -uexo pipefail

# The reference genome.
ACC=NZ_CP008918

# Select sequencing run.
SRR=SRR4124989

# The directory that stores the reference sequence.
mkdir -p ref

# The file that stores the reference genome.
REF=ref/${ACC}.fa

# Download the reference accession number.
efetch -db nuccore -id $ACC -format fasta > $REF

# The directory that stores the sequencing reads.
mkdir -p reads

# Download the sequencing data.
fastq-dump --origfmt  --split-files -O reads $SRR

# Optional step. Generate a report on the data size
echo "*** Sequence statistics"
seqkit stat reads/${SRR}_?.fastq
echo "*** "
#
# The one-liner code comes from:
#
# https://thegenomefactory.blogspot.com/2018/10/a-unix-one-liner-to-call-bacterial.html
#

# The number of CPUs to use.
CPUS=4

# The names for the read pairs.
R1=reads/${SRR}_1.fastq
R2=reads/${SRR}_2.fastq

# Unix trick: A single command line that is too long may be broken
# into several lines as long as each ends with a \ symbol.
#
# Call the variants.
minimap2 -a -x sr -t $CPUS $REF $R1 $R2 \
 | samtools sort -l 0 --threads $CPUS \
 | bcftools mpileup -Ou -B --min-MQ 60 -f $REF - \
 | bcftools call -Ou -v -m - \
 | bcftools norm -Ou -f $REF -d all - \
 | bcftools filter -Ov -e 'QUAL<40 || DP<10 || GT!="1/1"' \
 > variants.vcf

# Generate a statistics on variants.
bcftools stats variants.vcf > variants-stats.txt

# Plot the variant statistics.
plot-vcfstats -P -p plots variants-stats.txt

Powered by the release 1.3