Recipe View

Visualizing large scale genomic variation

Visualizing large scale genomic variations in short read data.

1 result • updated 3.7 years ago by Istvan Albert

Visualizing large scale genomic variations in short read data.

A large large-scale variation (over 50bp) typically does not fit into the sequence portion of a short read.

Paired-end read sequencing is a methodology that allows us to identify where the reorganization took place. Pairs in the unexpected orientation, template sizes of unexpected lengths together can provide the necessary guidance to identify the "reorganization" junction points relative to the reference genome.

This recipe provides code that:

Creates a modified genome based on the reference.
Simulates sequencing reads from this modified genome
Aligns the simulated reads against the reference
Allow you to visualize the results in IGV

How to edit the genome

You may manually edit the file specified as $GENOME to modify certain parts. You may also programmatically do so with the commands like:

Large deletion. Deletion applied from 2000-3000

 cat $REF | seqret --filter -sbegin 1 -send 2000 > part1
 cat $REF | seqret --filter -sbegin 3000 > part2
 cat part1 part2 | union -filter  > $GENOME

Copy number variation. The first 2000 bases are present three times.

  cat $REF | seqret --filter -sbegin 1 -send 2000 > part1
  cat part1 part1 $REF | union -filter  > GENOME.fa

Swap regions in the genome. The first 5000 bp are moved to the end.

 cat $REF | seqret --filter -send 5000 > part1
 cat $REF | seqret --filter -sbegin 5000 > part2
 cat part2 part1 | union -filter  > GENOME.fa

Reverse complement a region of the genome (1000 to 2000).

 cat $REF | seqret --filter -sbegin 1 -send 1000 > part1
 cat $REF | seqret --filter -sbegin 1000 -send 2000 -sreverse1 > part2
 cat $REF | seqret --filter -sbegin 2000 > part3

For more information see:

Biostar Handbook handbook
Applied Bioinformatics course

Copy recipe

You need write access to the project to edit.

# Stop on errors.
set -uex

# Reference genome accession number.
ACC=AF086833

# The reference genome stored locally.
REF=refs/$ACC.fa

# The "real" genome that we will simulate reads from.
GENOME=genome.fa

# How many reads to simulate.
N=1000

# Output bam file.
BAM=results.bam

# The directory that store the reference.
mkdir -p refs

# Delete the log file if it exists.
rm -f log.txt

# If the reference file does not exist.
if [ ! -f $REF ]; then
	# Get the reference genome in FASTA format.
    efetch -db nuccore -format fasta -id $ACC > $REF
	
	# Build the bwa index for the reference genome.
	bwa index $REF  2>> log.txt
	
	# Build IGV index for the reference genome.
	samtools faidx $REF
fi

# Copy the reference to genome only once first time around.

# If the genome does not exist, create it.
if [ ! -f $GENOME ]; then
	cp $REF $GENOME
fi

# Edit the genome between runs.
# Introduce changes into it then rerun this recipe.
# Visualize the BAM file.

# The read pair names.
R1=read1.fq
R2=read2.fq

# Simulate reads from the genome.
# No sequencing errors. Don't mutate the genome.
wgsim -N $N -e 0 -r 0 -R 0 $GENOME $R1 $R2

# Run the bwa aligner to create a BAM file.
bwa mem $REF $R1 $R2| samtools sort > $BAM

# Index the BAM file.
samtools index $BAM

You need write access to the original recipe to edit.

Click the buttons on the right to create new fields.

Add text field Add float field Add data field Add checkbox Add dropdown Add upload field Add integer field Add radio button

Name

Recipe display name

Identifier

Unique identifier for the recipe.

Image :

Optional image for the recipe ( 500px Maximum ).

Rank:

Used to order recipes (optional).

Visualizing large scale genomic variations in short read data.

A large large-scale variation (over 50bp) typically does not fit into the sequence portion of a short read.

Paired-end read sequencing is a methodology that allows us to identify where the reorganization took place. Pairs in the unexpected orientation, template sizes of unexpected lengths together can provide the necessary guidance to identify the "reorganization" junction points relative to the reference genome.

This recipe provides code that:

1. Creates a modified genome based on the reference.
1. Simulates sequencing reads from this modified genome
1. Aligns the simulated reads against the reference
1. Allow you to visualize the results in IGV

#### How to edit the genome

You may manually edit the file specified as `$GENOME` to modify certain parts. You may also programmatically do so with the commands like:

1. Large deletion. Deletion applied from 2000-3000

cat $REF | seqret --filter -sbegin 1 -send 2000 > part1
        cat $REF | seqret --filter -sbegin 3000 > part2
        cat part1 part2 | union -filter  > $GENOME

1. Copy number variation. The first 2000 bases are present three times.

cat $REF | seqret --filter -sbegin 1 -send 2000 > part1
         cat part1 part1 $REF | union -filter  > GENOME.fa

1. Swap regions in the genome.  The first 5000 bp  are moved to the end.

cat $REF | seqret --filter -send 5000 > part1
        cat $REF | seqret --filter -sbegin 5000 > part2
        cat part2 part1 | union -filter  > GENOME.fa

1. Reverse complement a region of the genome (1000 to 2000).

cat $REF | seqret --filter -sbegin 1 -send 1000 > part1
        cat $REF | seqret --filter -sbegin 1000 -send 2000 -sreverse1 > part2
        cat $REF | seqret --filter -sbegin 2000 > part3

For more information see:

* [Biostar Handbook][book] handbook
* [Applied Bioinformatics][appbio] course

[book]:  https://www.biostarhandbook.com/
[appbio]: https://www.biostarhandbook.com/edu/course/6/

Insert Image

From the web

From your computer

Cancel

Back