Friday, 14 October 2011

Managing the sequence assembly life cycle

As discussed, the work plan is very light on details of how the
informatics infrastructure will support the ongoing sequencing
efforts. Clearly we need to implement some infrastructure or
guidelines to manage the growing volume of data. below are some rough
notes to be put forward for inclusion in the Work Plan.

Managing the sequence assembly life cycle

Several partners are independently collecting sequence data and/or
working on assemblies as well as pursuing their own research interests
using all the available resources. This is a natural situation that is
to be expected on any large international sequencing project.

We would like to outline some simple guidelines for effectively
sharing and managing data. These guidelines apply both internally
between partners and for external, public releases [Although Robin
said she would write something about the latter].

From a bioinformatics perspective, we would like to promote a 'release
early release often' philosophy with regard to the sequence data and
assemblies. We believe that this approach has several important
advantages and, if properly managed, very few disadvantages.n
Guidelines for releasing sequencing data:

1) To help partners gain informatics experience with handling sequence
data, we suggest that all (or some portion of) the raw sequence data
should be made available from a centralised FTP site. Sharing raw data
in this way not only has educational benefits, it will encourage
rigorous validation of sequence assembly methods employed by the
partners, allowing comparison of assemblies prepared by different

2) Assemblies should be made available on a centralised FTP site. To
avoid confusion between assembly versions, each assembly should be
clearly tagged with an version number. Any change to the assembly
would trigger an update in the version number, however, major
(relatively stable) versions should be clearly documented for more
general use. We believe that an initial unstable assembly can provide
a lot of usefully information, but should be accurately tracked and
treated with appropriate caution.

3) For the reasons stated above partners should provide descriptions
of the methods used to produce a given assembly as clear 'standard
operating procedures' (SOPs).

What other guidelines can we suggest? What other 'support' or
management infra can we reasonably implement?

-- Reiteration of the points made above

From a biological perspective, working with very rough or highly
dynamic assemblies may be counter productive. However, from a
bioinformatics perspective, even very preliminary data is desirable
for a number of reasons:

* Can be used to provide experience and training with different data
types and data produced on different platforms.

- For example, we would like to study combined assembly using Solexa
and 454 data. The best way to gain understanding of the methods to get
hands on experience.

* Can be used to provide realistic input to explore various
algorithms or bioinformatics methods that are under development.

- For example, if we wanted to map the Solexa data to the sequenced
RH BACs to look for SNPs, we could rigorously compare methods by using
real data.

* Allows independent validation of the data, so that material
improvements between releases can be independently and objectively assessed.

- For example we could be interested in developing a pipeline to
'scaffold' BACs using the assembly, and these scaffolds could be
compared to scaffolds produced by other methods. We would expect each
new release of the assembly to improve the performance of these
methods, and errors could be feedback between partners to improve the
overall levels of sequencing quality.

formlink magic

# {{
# #vardefine: CUSTOMER|{{ #show: {{{Project1|}}} | ?Customer | link=none }} }} {{
# #vardefine: REGARDING|{{ #show: {{{Project1|}}} | ?Regarding | link=none }} }} {{
# #vardefine: ADDRESS|{{ #show: {{ #var: CUSTOMER }} | ?Invoice address }} }} {{
# #vardefine: CURRENCY|{{ #show: {{ #var: REGARDING }} | ?Currency }} }} {{
# #vardefine: TOTALAMOUNT|42.42 }} {{
# #vardefine: INVOICEDATE|{{ #time: Y-m-d|{{CURRENTTIMESTAMP}} {{TZ offset}} }} }}
# {{ #ifeq: {{ #var: ADDRESS }} | n/a | Invoice address not available - invoice generation impossible. |
# {{ #formlink: Invoice | Create invoice | link | Invoice[InvoiceNumber]={{ #explode: {{PAGENAME}}| |2 }}&Invoice[Project1]={{{Project1|}}}{{
# #if: {{ #var: ADDRESS }} |&Invoice[InvoiceAddress]={{ #urlencode: {{ #var: ADDRESS }} }} }}{{
# #if: {{ #var: CURRENCY }} |&Invoice[Currency]={{ #var: CURRENCY }} }}{{
# #if: {{ #var: TOTALAMOUNT }} |&Invoice[TotalAmount]={{ #var: TOTALAMOUNT }} }}{{
# #if: {{ #var: INVOICEDATE }}

{{ #if: {{ #forargs: ProjectNumber|param|value | {{ #ifexist: {{ #var: value }} | | {{ #vardefineecho: INVALIDPROJECTS|{{ #var: INVALIDPROJECTS }} {{ #var: value }} }} }} }}
| Error: The following projects are non-existent: {{ #var: INVALIDPROJECTS }} |
#vardefine: CUSTOMER|{{ #show: {{{ProjectNumber1|}}} | ?Customer | link=none }} }} {{
#vardefine: REGARDING|{{ #show: {{{ProjectNumber1|}}} | ?Regarding | link=none }} }} {{
#vardefine: ADDRESS|{{ #show: {{ #var: CUSTOMER }} | ?Invoice address }} }} {{
#vardefine: CURRENCY|{{ #show: {{ #var: REGARDING }} | ?Currency }} }} {{
#vardefine: TOTALAMOUNT|42.42 }} {{
#vardefine: INVOICEDATE|{{ #time: Y-m-d|{{CURRENTTIMESTAMP}} {{TZ offset}} }} }}
{{ #ifeq: {{ #var: ADDRESS }} | n/a | Invoice address not available - invoice generation impossible. |

{{ #formlink: Invoice | Create invoice | post button | Invoice[InvoiceNumber]={{ #explode: {{PAGENAME}}| |2 }}{{
#if: {{ #var: ADDRESS }} |&Invoice[InvoiceAddress]={{ #var: ADDRESS }} }}{{
#if: {{ #var: CURRENCY }} |&Invoice[Currency]={{ #var: CURRENCY }} }}{{
#if: {{ #var: TOTALAMOUNT }} |&Invoice[TotalAmount]={{ #var: TOTALAMOUNT }} }}{{
#if: {{ #var: INVOICEDATE }} |&Invoice[InvoiceDate]={{ #var: INVOICEDATE }} }}{{
#forargs: ProjectNumber|param|value |&Invoice[ProjectNumber{{ #var: param }}]={{ #var: value }} }}{{
#vardefine: i | 1 }}{{ #while: |{{{ProjectNumber{{ #var: i }}|}}}
|&Invoice[ProjectName{{ #var: i }}]={{ #show: {{{ProjectNumber{{ #var: i }} }}} | ?Project name }}{{
#vardefine: i | {{ #expr: {{ #var: i }} + 1 }} }} }}

with the following data:


Invoice address{{ #tag:pre|{{ #var: ADDRESS }} }}
Currency{{ #var: CURRENCY }}
Total amount{{ #var: TOTALAMOUNT }}
Invoice date{{ #var: INVOICEDATE }}

{{ #forargs: ProjectNumber|param|value
=== Project {{ #var: param }} ===
Project number:{{ #var: value }}
Project name:{{ #show: {{ #var: value }} | ?Project name }}


Overview of paired end data

Overview of paired end data


The two half runs have slightly lower quality scores than then 1/8th
run (see plots), but have longer read lenths by about 100 bp. Reads
from all three sets map equivelently to assembly 2.0. The distribution
of insert sizes for all three sets is NEARLY IDENTICAL.


Number of reads / bases from the SFF files:

F2K9UYJ04: 102,441 / 26,134,142
F3R7RWZ01: 605,101 / 200,205,241
F3R7RWZ02: 582,634 / 188,925,518

Total: 1,290,176 / 415,264,901

Percent of reads < 200 bp...

F2K9UYJ04: 29%
F3R7RWZ01: 14%
F3R7RWZ02: 15%

Quality plots attached (as before) See quality.F????????-n.jpg. I'm
still not entirely sure what to read from the quality plots. However,
its clear that the quality of the two half runs is slighly lower than
the quality of the 1/8th run, but the length of the reads is greater
in these two runs.

Here is a breakdown of the overall read lengths:

Min. 1st Qu. Median Mean 3rd Qu. Max. N50
40.0 190.0 252.0 255.1 321.0 1014.0 289

Min. 1st Qu. Median Mean 3rd Qu. Max. N50
40.0 266.0 351.0 330.9 411.0 2026.0 380

Min. 1st Qu. Median Mean 3rd Qu. Max. N50
40.0 257.0 345.0 324.3 406.0 1103.0 376

The distribution of read sinlge end read lengths is given in

All the reads were mapped against assembly 2.0 using a 99% identity
threshold. Broadly all three sets had similar mapping characteristics
for either paired reads or single reads. See map_stat-0.jpg and
map_stat-1.jpg. Here is what gsMapper had to say about 'ReadStatus'
and 'PairStatus' after mapping:

Unmapped 609,527
Full 589,731
Partial 224,379
Repeat 572,589

Total 1,996,226

BothUnmapped 145,718
OneUnmapped 66,450
MultiplyMapped 312,034
TruePair 151,787
FalsePair 39,494

Total 715,483

Note about counts:

Looking at the individual reads, there are 715,483 of 1,290,176 reads
were split into pairs (leaving 574,693 reads that were not
split). This gives a total of (715,483 * 2) + 574,693 = 2,005,659
reads to map... not sure how to reconcile this inconsistency.

From the 454 manual:

The status string describes the read's fate in the alignment, and can
be one of the following four values:

Full - the read is fully aligned to the reference (every base)
Partial - only part of the read aligned to the reference
Repeat - the read aligned equally well to multiple locations in
the reference
Unmapped - the read did not align to the reference
TooShort - the trimmed read was too short to be used in the
computation (i.e., shorter than 50 bases, unless 454
Paired End Reads are included in the dataset, in which
case, shorter than 15 bases).

Status - the status of the pair in the mapping, with the following
possible values:

a. BothUnmapped - both halves of the pair were unmapped
b. OneUnmapped - one of the reads in the pair were unmapped
c. MultiplyMapped - one or both of the reads in the pair were marked
as Repeat.
d. TruePair - both halves of the pair were mapped into the same
reference sequence, with the correct relative direction, and are
within the expected distance of each other.
e. FalsePair - the halves were mapped to the same reference
sequence, but the directions of the alignment is inconsistent
with a Paired End pair or the distance between the halves is
outside the expected distance.

Looking at the TruePair distances, I found the distribution of
distances from all runs were almost IDENTICAL! See the
plot_distances.4.jpg and plot_distances.5.jpg, attached). As before
there is a small peak at 8 kb and a big peak at 14 kb (13 to 15
kb). The similarity in the distribution of insert sizes makes me
wonder how many unique reads we have.

Is coverage 'random'?

The high fraction of One or BothUnmapped read pairs (145,718 + 66,450
out of 715,483 ~= 30%) suggests either there is something wrong with
one or more of:

1) the DNA
2) the sequencing run,
3) the mapper (the mapping run), or
4) the assembly.

I'd like to run an alternative mapper (such as Mira) to rule out
option 3, and hopefully the 454 assembly (when it arrives) can be used
to rule out option 4. I guess this is an area for further

Removing 'MultiplyMapped' we have:

145,718 + 66,450 out of 715,483 - 312,034 ~= 50%