Recipe View

Half-sequence and half mythical-beast, "unaligned" BAM files are used to store FASTQ files.

0 results • updated 3.6 years ago by Istvan Albert

Half-sequence and half mythical-beast, "unaligned" BAM files are used to store FASTQ files.

SAM/BAM are alignment formats, thus it feels quite anachronistic to use them to store "unaligned" sequences. On the other hand BAM files have quite a few advantages over FASTQ:

Are compressed,
Line oriented (all information on the sequence is on a single line),
Can store sample information via tags,
There are many tools that can operate on BAM files (extract by tags, filter by tags, etc)

In addition, the BAM format also stores not just the alignments but the entire reads sequences. To take advantage of the previously listed features, some bioinformaticians began storing their original, raw data in a so-called unaligned BAM. Now we have seen it all, unaligned reads stored in an alignment format.

This recipe demonstrates code that:

Accesses an NCBI BioProject with a given accession number.
Downloads 5 sequencing runs from the project.
Transforms each downloaded FASTQ file into a BAM file while tagging the reads from that file with the SRR number tag.
Merges the resulting BAM files into a single BAM that now contains all sequencing data for the project in just one file.
Finally the recipe demonstrates the code needed to revert the process of extracting the original data from an unaligned BAM

Copy recipe

You need write access to the project to edit.

# Stop script on error.
set -uex

# The SRR BioProject number for the sequencing data.
PROJECT=PRJNA257197

# The number of datasets to subselect from the project.
N=5

# Get the project run information.
esearch -db sra -query $PROJECT  | efetch -format runinfo > runinfo.txt

# Select the first N elements. Keep only valid SRR numbers.
cat runinfo.txt | cut -f 1 -d , | grep SRR | head -$N > selected.txt

# Store the data in the reads folder.
mkdir -p reads

# Download the SRR data for each
cat selected.txt | parallel fastq-dump -O reads -X 1000 --split-files {}

# Create a directory for bam files
mkdir -p bam

# Generate a separate BAM file for each SAMPLE.
cat selected.txt | parallel "picard FastqToSam F1=reads/{}_1.fastq F2=reads/{}_1.fastq O=bam/{}.bam  RG=GROUP-{} LB=LIB-{} SM=SAMPLE_{} QUIET=true 2>> log.txt"

# Merge all the BAM files into one.
samtools merge -f all.bam bam/*.bam

# Investigate the readgroups in the header.
echo ""
echo "SAM file header:"
samtools view -H all.bam

echo ""
echo "Number of alignments with read group: GROUP-SRR1972919"
samtools view -c -r GROUP-SRR1972919 all.bam

# Reverting the process is to extract reads, tagged with readgroups to paired files.
samtools fastq -t -1 all1.fq -2 all2.fq all.bam

# To convert just one specific read group.
samtools view -r GROUP-SRR1972919 all.bam | samtools fastq -t -1 all_SRR1972919_1.fq -2 all_SRR1972919_2fq -

You need write access to the original recipe to edit.

Click the buttons on the right to create new fields.

Add text field Add float field Add data field Add checkbox Add dropdown Add upload field Add integer field Add radio button

Name

Recipe display name

Identifier

Unique identifier for the recipe.

Image :

Optional image for the recipe ( 500px Maximum ).

Rank:

Used to order recipes (optional).

Half-sequence and half mythical-beast, "unaligned" BAM files are used to store FASTQ files.

SAM/BAM are alignment formats, thus it feels quite anachronistic to use them to store "unaligned" sequences. On the other hand BAM files have quite a few advantages over FASTQ:

1. Are compressed, 
1. Line oriented (all information on the sequence is on a single line), 
1. Can store sample information via tags, 
1. There are many tools that can operate on BAM files (extract by tags, filter by tags, etc)

In addition, the BAM format also stores not just the alignments but the entire reads sequences. To take advantage of the previously listed features, some bioinformaticians began storing their original, raw data in a so-called  *unaligned* BAM. Now we have seen it all, *unaligned* reads stored in an *alignment format*.

This recipe demonstrates code that:

1. Accesses an NCBI BioProject with a given accession number.
2. Downloads 5 sequencing runs from the project.
3. Transforms each downloaded FASTQ file into a BAM file while tagging the reads from that file with the SRR number tag.
4. Merges the resulting BAM files into a single BAM that now contains all sequencing data for the project in just one file.
5. Finally the recipe demonstrates the code needed to revert the process of extracting the original data from an unaligned BAM

Insert Image

From the web

From your computer

Cancel

Back

Unaligned BAM files

Insert Image