Recipe View

Visualizing large scale genomic variations.

1 result • updated 5.6 years ago by Istvan Albert

Visualizing large scale genomic variations.

We call a variation large-scale when the changes do not fit into the sequence portion of a short read.

Paired-end read sequencing is then the methodology that allows us to identify where the reorganization took place. Pairs in the unexpected orientation, template sizes of unexpected lengths together can provide the necessary guidance to identify the "reorganization" junction points relative to the reference genome.

This recipe provides code that:

Creates a modified genome based on the reference.
Simulates sequencing reads from this modified genome
Aligns the simulated reads against the reference
Allow you to visualize the results in IGV

Lecture

A detailed presentation that explains the steps and rationale for this recipe can be found at:

Lecture for Recipe 5: Visualizing Large-Scale Genomic Variation

Please refer to the lecture above for links to the chapters that cover each concept.

Copy recipe

You need write access to the project to edit.

# Stop on errors.
set -uexo pipefail

# UNIX trick: Adding the -x flag will also show the commands as these are executed.
# This can help you track down the errors as these happen.

# There will be many ancillary files.
# Better keep them out of sight.
mkdir -p db

# The Genome Accesion Number
ACC=AF086833

# This will be the name of the file that holds the reference genome that we align against.
REF=db/${ACC}.fa

# It is best to specify all variables at the top ot the program.
# That way it is easier to see their usage.

# The read pairs that will hold the simulation results.
READ1=pair1.fq
READ2=pair2.fq

# Fetch the reference genome from NCBI.
efetch -db=nuccore -format=fasta -id=${ACC} > $REF

# Index the reference genome. This needs to be done only once.
bwa index $REF 2>> log.txt

# Create annother index file for the reference 
#so that we can load it up with IGV
samtools faidx $REF

# Unix trick: bash functions allow repeating batch a commands repeatedly.
# This function construct below will run when called by its name.
# The first parameter will be assigned to $1 and so on.
run_experiment()
{
    # This will be the first parameter assigned to the function.
    BAM=$1
  
    # Run the simulation using the GENOME.fa file.
    wgsim -e 0 -r 0 -R 0 -X 0 -N 2500 GENOME.fa ${READ1} ${READ2} >> simulation.txt 2>> log.txt
  
    # Align the simulated reads against the genome.
    bwa mem ${REF} ${READ1} ${READ2} 2>> log.txt | samtools sort > ${BAM}
  
    # Index the resulting BAM file.
    samtools index ${BAM}
}

# Experiment 1. Take the genome to be exactly the reference.
cp $REF GENOME.fa
run_experiment align1.bam

# Experiment 2. Large deletion. Deletion applied from 2000-3000
cat $REF | seqret --filter -sbegin 1 -send 2000 > part1
cat $REF | seqret --filter -sbegin 3000 > part2
cat part1 part2 | union -filter  > GENOME.fa
run_experiment align2.bam

# Experiment 3. Copy number variation. The first 2000 bases are present three times.
cat $REF | seqret --filter -sbegin 1 -send 2000 > part1
cat part1 part1 $REF | union -filter  > GENOME.fa
run_experiment align3.bam

# Experiment 4. Swap the genome. First 5000 bp  go last.
cat $REF | seqret --filter -send 5000 > part1
cat $REF  | seqret --filter -sbegin 5000 > part2
cat part2 part1 | union -filter  > GENOME.fa
run_experiment align4.bam

# Experiment 5. Reverse complement a region of the genome (1000 to 2000).
cat $REF | seqret --filter -sbegin 1 -send 1000 > part1
cat $REF | seqret --filter -sbegin 1000 -send 2000 -sreverse1 > part2
cat $REF | seqret --filter -sbegin 2000 > part3
cat part1 part2 part3 | union -filter  > GENOME.fa
run_experiment align5.bam

# Experiment 6. Ten unknown mutations, both point and block mutations.
# How many can your identify?
cat $REF | msbar -filter -count 10 -point 1 -block 1 -minimum 10 -maximum 500 > GENOME.fa
run_experiment align6.bam

# Remove temporary files
rm -f part?

You need write access to the original recipe to edit.

Click the buttons on the right to create new fields.

Add text field Add float field Add data field Add checkbox Add dropdown Add upload field Add integer field Add radio button

Edit the content of each interface element.

[acc]
label = "Genome Accession Number"
display = "TEXTBOX"
value = "AF086833"
help = "Must be an NCBI accession number"

[settings]
name = "Large Scale Genomic Variation"
template = "Large_Scale_Genomic_Variation_cookbook_34.sh"
image = "Large_Scale_Genomic_Variation_cookbook_34.png"
id = 34
recipe_uid = "largevariants"
uid = "largevariants"
help = "Visualizing large scale genomic variations.\n\nWe call a variation large-scale when the changes do not fit into the sequence portion of a short read. \n\nPaired-end read sequencing is then the methodology that allows us to identify where the reorganization took place. Pairs in the unexpected orientation, template sizes of unexpected lengths together can provide the necessary guidance to identify the \"reorganization\" junction points relative to the reference genome.\n\nThis recipe provides code that:\n\n1. Creates a modified genome based on the reference.\n1. Simulates sequencing reads from this modified genome\n1. Aligns the simulated reads against the reference\n1. Allow you to visualize the results in IGV\n\n#### Lecture\n\nA detailed presentation that explains the steps and rationale for this recipe can be found at:\n\n* [Lecture for Recipe 5: Visualizing Large-Scale Genomic Variation][slides]\n\nPlease refer to the lecture above for links to the chapters that cover each concept.\n\n[slides]: https://www.biostarhandbook.com/edu/lecture/view/45/"
url = "http://localhost8000"

You need write access to the original recipe to edit.

Name

Recipe display name

Identifier

Unique identifier for the recipe.

Image :

Optional image for the recipe ( 500px Maximum ).

Rank:

Used to order recipes (optional).

Visualizing large scale genomic variations.

We call a variation large-scale when the changes do not fit into the sequence portion of a short read.

Paired-end read sequencing is then the methodology that allows us to identify where the reorganization took place. Pairs in the unexpected orientation, template sizes of unexpected lengths together can provide the necessary guidance to identify the "reorganization" junction points relative to the reference genome.

This recipe provides code that:

1. Creates a modified genome based on the reference.
1. Simulates sequencing reads from this modified genome
1. Aligns the simulated reads against the reference
1. Allow you to visualize the results in IGV

#### Lecture

A detailed presentation that explains the steps and rationale for this recipe can be found at:

* [Lecture for Recipe 5: Visualizing Large-Scale Genomic Variation][slides]

Please refer to the lecture above for links to the chapters that cover each concept.

[slides]: https://www.biostarhandbook.com/edu/lecture/view/45/

Insert Image

From the web

From your computer

Cancel

Back