BioBlog5000: Overview of paired end data

Overview of paired end data

Summary:

The two half runs have slightly lower quality scores than then 1/8th

run (see plots), but have longer read lenths by about 100 bp. Reads

from all three sets map equivelently to assembly 2.0. The distribution

of insert sizes for all three sets is NEARLY IDENTICAL.

Details:

Number of reads / bases from the SFF files:

F2K9UYJ04: 102,441 / 26,134,142

F3R7RWZ01: 605,101 / 200,205,241

F3R7RWZ02: 582,634 / 188,925,518

Total: 1,290,176 / 415,264,901

Percent of reads < 200 bp...

F2K9UYJ04: 29%

F3R7RWZ01: 14%

F3R7RWZ02: 15%

Quality plots attached (as before) See quality.F????????-n.jpg. I'm

still not entirely sure what to read from the quality plots. However,

its clear that the quality of the two half runs is slighly lower than

the quality of the 1/8th run, but the length of the reads is greater

in these two runs.

Here is a breakdown of the overall read lengths:

F2K9UYJ04

Min. 1st Qu. Median Mean 3rd Qu. Max. N50

40.0 190.0 252.0 255.1 321.0 1014.0 289

F3R7RWZ01

Min. 1st Qu. Median Mean 3rd Qu. Max. N50

40.0 266.0 351.0 330.9 411.0 2026.0 380

F3R7RWZ02

Min. 1st Qu. Median Mean 3rd Qu. Max. N50

40.0 257.0 345.0 324.3 406.0 1103.0 376

The distribution of read sinlge end read lengths is given in

plot_lengths-5.jpg

All the reads were mapped against assembly 2.0 using a 99% identity

threshold. Broadly all three sets had similar mapping characteristics

for either paired reads or single reads. See map_stat-0.jpg and

map_stat-1.jpg. Here is what gsMapper had to say about 'ReadStatus'

and 'PairStatus' after mapping:

Unmapped 609,527

Full 589,731

Partial 224,379

Repeat 572,589

Total 1,996,226

BothUnmapped 145,718

OneUnmapped 66,450

MultiplyMapped 312,034

TruePair 151,787

FalsePair 39,494

Total 715,483

Note about counts:

Looking at the individual reads, there are 715,483 of 1,290,176 reads

were split into pairs (leaving 574,693 reads that were not

split). This gives a total of (715,483 * 2) + 574,693 = 2,005,659

reads to map... not sure how to reconcile this inconsistency.

From the 454 manual:

The status string describes the read's fate in the alignment, and can

be one of the following four values:

Full - the read is fully aligned to the reference (every base)

Partial - only part of the read aligned to the reference

Repeat - the read aligned equally well to multiple locations in

the reference

Unmapped - the read did not align to the reference

TooShort - the trimmed read was too short to be used in the

computation (i.e., shorter than 50 bases, unless 454

Paired End Reads are included in the dataset, in which

case, shorter than 15 bases).

Status - the status of the pair in the mapping, with the following

possible values:

a. BothUnmapped - both halves of the pair were unmapped

b. OneUnmapped - one of the reads in the pair were unmapped

c. MultiplyMapped - one or both of the reads in the pair were marked

as Repeat.

d. TruePair - both halves of the pair were mapped into the same

reference sequence, with the correct relative direction, and are

within the expected distance of each other.

e. FalsePair - the halves were mapped to the same reference

sequence, but the directions of the alignment is inconsistent

with a Paired End pair or the distance between the halves is

outside the expected distance.

Looking at the TruePair distances, I found the distribution of

distances from all runs were almost IDENTICAL! See the

plot_distances.4.jpg and plot_distances.5.jpg, attached). As before

there is a small peak at 8 kb and a big peak at 14 kb (13 to 15

kb). The similarity in the distribution of insert sizes makes me

wonder how many unique reads we have.

Is coverage 'random'?

The high fraction of One or BothUnmapped read pairs (145,718 + 66,450

out of 715,483 ~= 30%) suggests either there is something wrong with

one or more of:

1) the DNA

2) the sequencing run,

3) the mapper (the mapping run), or

4) the assembly.

I'd like to run an alternative mapper (such as Mira) to rule out

option 3, and hopefully the 454 assembly (when it arrives) can be used

to rule out option 4. I guess this is an area for further

investigation.

Removing 'MultiplyMapped' we have:

145,718 + 66,450 out of 715,483 - 312,034 ~= 50%

BioBlog5000

Friday, 14 October 2011

Overview of paired end data

No comments:

About Me