Friday 14 October 2011

Overview of paired end data


Overview of paired end data

Summary:

The two half runs have slightly lower quality scores than then 1/8th
run (see plots), but have longer read lenths by about 100 bp. Reads
from all three sets map equivelently to assembly 2.0. The distribution
of insert sizes for all three sets is NEARLY IDENTICAL.



Details:

Number of reads / bases from the SFF files:

F2K9UYJ04: 102,441 / 26,134,142
F3R7RWZ01: 605,101 / 200,205,241
F3R7RWZ02: 582,634 / 188,925,518

Total: 1,290,176 / 415,264,901


Percent of reads < 200 bp...

F2K9UYJ04: 29%
F3R7RWZ01: 14%
F3R7RWZ02: 15%


Quality plots attached (as before) See quality.F????????-n.jpg. I'm
still not entirely sure what to read from the quality plots. However,
its clear that the quality of the two half runs is slighly lower than
the quality of the 1/8th run, but the length of the reads is greater
in these two runs.


Here is a breakdown of the overall read lengths:

F2K9UYJ04
Min. 1st Qu. Median Mean 3rd Qu. Max. N50
40.0 190.0 252.0 255.1 321.0 1014.0 289

F3R7RWZ01
Min. 1st Qu. Median Mean 3rd Qu. Max. N50
40.0 266.0 351.0 330.9 411.0 2026.0 380

F3R7RWZ02
Min. 1st Qu. Median Mean 3rd Qu. Max. N50
40.0 257.0 345.0 324.3 406.0 1103.0 376


The distribution of read sinlge end read lengths is given in
plot_lengths-5.jpg



All the reads were mapped against assembly 2.0 using a 99% identity
threshold. Broadly all three sets had similar mapping characteristics
for either paired reads or single reads. See map_stat-0.jpg and
map_stat-1.jpg. Here is what gsMapper had to say about 'ReadStatus'
and 'PairStatus' after mapping:

Unmapped 609,527
Full 589,731
Partial 224,379
Repeat 572,589

Total 1,996,226

BothUnmapped 145,718
OneUnmapped 66,450
MultiplyMapped 312,034
TruePair 151,787
FalsePair 39,494

Total 715,483


Note about counts:

Looking at the individual reads, there are 715,483 of 1,290,176 reads
were split into pairs (leaving 574,693 reads that were not
split). This gives a total of (715,483 * 2) + 574,693 = 2,005,659
reads to map... not sure how to reconcile this inconsistency.


From the 454 manual:

The status string describes the read's fate in the alignment, and can
be one of the following four values:

Full - the read is fully aligned to the reference (every base)
Partial - only part of the read aligned to the reference
Repeat - the read aligned equally well to multiple locations in
the reference
Unmapped - the read did not align to the reference
TooShort - the trimmed read was too short to be used in the
computation (i.e., shorter than 50 bases, unless 454
Paired End Reads are included in the dataset, in which
case, shorter than 15 bases).


Status - the status of the pair in the mapping, with the following
possible values:

a. BothUnmapped - both halves of the pair were unmapped
b. OneUnmapped - one of the reads in the pair were unmapped
c. MultiplyMapped - one or both of the reads in the pair were marked
as Repeat.
d. TruePair - both halves of the pair were mapped into the same
reference sequence, with the correct relative direction, and are
within the expected distance of each other.
e. FalsePair - the halves were mapped to the same reference
sequence, but the directions of the alignment is inconsistent
with a Paired End pair or the distance between the halves is
outside the expected distance.



Looking at the TruePair distances, I found the distribution of
distances from all runs were almost IDENTICAL! See the
plot_distances.4.jpg and plot_distances.5.jpg, attached). As before
there is a small peak at 8 kb and a big peak at 14 kb (13 to 15
kb). The similarity in the distribution of insert sizes makes me
wonder how many unique reads we have.

Is coverage 'random'?



The high fraction of One or BothUnmapped read pairs (145,718 + 66,450
out of 715,483 ~= 30%) suggests either there is something wrong with
one or more of:



1) the DNA
2) the sequencing run,
3) the mapper (the mapping run), or
4) the assembly.


I'd like to run an alternative mapper (such as Mira) to rule out
option 3, and hopefully the 454 assembly (when it arrives) can be used
to rule out option 4. I guess this is an area for further
investigation.


Removing 'MultiplyMapped' we have:

145,718 + 66,450 out of 715,483 - 312,034 ~= 50%


No comments: