BioBlog5000: October 2011

Managing the sequence assembly life cycle

As discussed, the work plan is very light on details of how the

informatics infrastructure will support the ongoing sequencing

efforts. Clearly we need to implement some infrastructure or

guidelines to manage the growing volume of data. below are some rough

notes to be put forward for inclusion in the Work Plan.

Managing the sequence assembly life cycle

Several partners are independently collecting sequence data and/or

working on assemblies as well as pursuing their own research interests

using all the available resources. This is a natural situation that is

to be expected on any large international sequencing project.

We would like to outline some simple guidelines for effectively

sharing and managing data. These guidelines apply both internally

between partners and for external, public releases [Although Robin

said she would write something about the latter].

From a bioinformatics perspective, we would like to promote a 'release

early release often' philosophy with regard to the sequence data and

assemblies. We believe that this approach has several important

advantages and, if properly managed, very few disadvantages.n

Guidelines for releasing sequencing data:

1) To help partners gain informatics experience with handling sequence

data, we suggest that all (or some portion of) the raw sequence data

should be made available from a centralised FTP site. Sharing raw data

in this way not only has educational benefits, it will encourage

rigorous validation of sequence assembly methods employed by the

partners, allowing comparison of assemblies prepared by different

methods.

2) Assemblies should be made available on a centralised FTP site. To

avoid confusion between assembly versions, each assembly should be

clearly tagged with an version number. Any change to the assembly

would trigger an update in the version number, however, major

(relatively stable) versions should be clearly documented for more

general use. We believe that an initial unstable assembly can provide

a lot of usefully information, but should be accurately tracked and

treated with appropriate caution.

3) For the reasons stated above partners should provide descriptions

of the methods used to produce a given assembly as clear 'standard

operating procedures' (SOPs).

What other guidelines can we suggest? What other 'support' or

management infra can we reasonably implement?

-- Reiteration of the points made above

From a biological perspective, working with very rough or highly

dynamic assemblies may be counter productive. However, from a

bioinformatics perspective, even very preliminary data is desirable

for a number of reasons:

* Can be used to provide experience and training with different data

types and data produced on different platforms.

- For example, we would like to study combined assembly using Solexa

and 454 data. The best way to gain understanding of the methods to get

hands on experience.

* Can be used to provide realistic input to explore various

algorithms or bioinformatics methods that are under development.

- For example, if we wanted to map the Solexa data to the sequenced

RH BACs to look for SNPs, we could rigorously compare methods by using

real data.

* Allows independent validation of the data, so that material

improvements between releases can be independently and objectively assessed.

- For example we could be interested in developing a pipeline to

'scaffold' BACs using the assembly, and these scaffolds could be

compared to scaffolds produced by other methods. We would expect each

new release of the assembly to improve the performance of these

methods, and errors could be feedback between partners to improve the

overall levels of sequencing quality.

formlink magic

# {{

# #vardefine: CUSTOMER|{{ #show: {{{Project1|}}} | ?Customer | link=none }} }} {{

# #vardefine: REGARDING|{{ #show: {{{Project1|}}} | ?Regarding | link=none }} }} {{

# #vardefine: ADDRESS|{{ #show: {{ #var: CUSTOMER }} | ?Invoice address }} }} {{

# #vardefine: CURRENCY|{{ #show: {{ #var: REGARDING }} | ?Currency }} }} {{

# #vardefine: TOTALAMOUNT|42.42 }} {{

# #vardefine: INVOICEDATE|{{ #time: Y-m-d|{{CURRENTTIMESTAMP}} {{TZ offset}} }} }}

# {{ #ifeq: {{ #var: ADDRESS }} | n/a | Invoice address not available - invoice generation impossible. |

# {{ #formlink: Invoice | Create invoice | link | Invoice[InvoiceNumber]={{ #explode: {{PAGENAME}}| |2 }}&Invoice[Project1]={{{Project1|}}}{{

# #if: {{ #var: ADDRESS }} |&Invoice[InvoiceAddress]={{ #urlencode: {{ #var: ADDRESS }} }} }}{{

# #if: {{ #var: CURRENCY }} |&Invoice[Currency]={{ #var: CURRENCY }} }}{{

# #if: {{ #var: TOTALAMOUNT }} |&Invoice[TotalAmount]={{ #var: TOTALAMOUNT }} }}{{

# #if: {{ #var: INVOICEDATE }}

{{ #if: {{ #forargs: ProjectNumber|param|value | {{ #ifexist: {{ #var: value }} | | {{ #vardefineecho: INVALIDPROJECTS|{{ #var: INVALIDPROJECTS }} {{ #var: value }} }} }} }}

| Error: The following projects are non-existent: {{ #var: INVALIDPROJECTS }} |

{{

#vardefine: CUSTOMER|{{ #show: {{{ProjectNumber1|}}} | ?Customer | link=none }} }} {{

#vardefine: REGARDING|{{ #show: {{{ProjectNumber1|}}} | ?Regarding | link=none }} }} {{

#vardefine: ADDRESS|{{ #show: {{ #var: CUSTOMER }} | ?Invoice address }} }} {{

#vardefine: CURRENCY|{{ #show: {{ #var: REGARDING }} | ?Currency }} }} {{

#vardefine: TOTALAMOUNT|42.42 }} {{

#vardefine: INVOICEDATE|{{ #time: Y-m-d|{{CURRENTTIMESTAMP}} {{TZ offset}} }} }}

{{ #ifeq: {{ #var: ADDRESS }} | n/a | Invoice address not available - invoice generation impossible. |

{{ #formlink: Invoice | Create invoice | post button | Invoice[InvoiceNumber]={{ #explode: {{PAGENAME}}| |2 }}{{

#if: {{ #var: ADDRESS }} |&Invoice[InvoiceAddress]={{ #var: ADDRESS }} }}{{

#if: {{ #var: CURRENCY }} |&Invoice[Currency]={{ #var: CURRENCY }} }}{{

#if: {{ #var: TOTALAMOUNT }} |&Invoice[TotalAmount]={{ #var: TOTALAMOUNT }} }}{{

#if: {{ #var: INVOICEDATE }} |&Invoice[InvoiceDate]={{ #var: INVOICEDATE }} }}{{

#forargs: ProjectNumber|param|value |&Invoice[ProjectNumber{{ #var: param }}]={{ #var: value }} }}{{

#vardefine: i | 1 }}{{ #while: |{{{ProjectNumber{{ #var: i }}|}}}

|&Invoice[ProjectName{{ #var: i }}]={{ #show: {{{ProjectNumber{{ #var: i }} }}} | ?Project name }}{{

#vardefine: i | {{ #expr: {{ #var: i }} + 1 }} }} }}

}}

with the following data:

}}

Invoice address	{{ #tag:pre\|{{ #var: ADDRESS }} }}
Currency	{{ #var: CURRENCY }}
Total amount	{{ #var: TOTALAMOUNT }}
Invoice date	{{ #var: INVOICEDATE }}

{{ #forargs: ProjectNumber|param|value

=== Project {{ #var: param }} ===

Project number:	{{ #var: value }}
Project name:	{{ #show: {{ #var: value }} \| ?Project name }}

}}

Overview of paired end data

Summary:

The two half runs have slightly lower quality scores than then 1/8th

run (see plots), but have longer read lenths by about 100 bp. Reads

from all three sets map equivelently to assembly 2.0. The distribution

of insert sizes for all three sets is NEARLY IDENTICAL.

Details:

Number of reads / bases from the SFF files:

F2K9UYJ04: 102,441 / 26,134,142

F3R7RWZ01: 605,101 / 200,205,241

F3R7RWZ02: 582,634 / 188,925,518

Total: 1,290,176 / 415,264,901

Percent of reads < 200 bp...

F2K9UYJ04: 29%

F3R7RWZ01: 14%

F3R7RWZ02: 15%

Quality plots attached (as before) See quality.F????????-n.jpg. I'm

still not entirely sure what to read from the quality plots. However,

its clear that the quality of the two half runs is slighly lower than

the quality of the 1/8th run, but the length of the reads is greater

in these two runs.

Here is a breakdown of the overall read lengths:

F2K9UYJ04

Min. 1st Qu. Median Mean 3rd Qu. Max. N50

40.0 190.0 252.0 255.1 321.0 1014.0 289

F3R7RWZ01

Min. 1st Qu. Median Mean 3rd Qu. Max. N50

40.0 266.0 351.0 330.9 411.0 2026.0 380

F3R7RWZ02

Min. 1st Qu. Median Mean 3rd Qu. Max. N50

40.0 257.0 345.0 324.3 406.0 1103.0 376

The distribution of read sinlge end read lengths is given in

plot_lengths-5.jpg

All the reads were mapped against assembly 2.0 using a 99% identity

threshold. Broadly all three sets had similar mapping characteristics

for either paired reads or single reads. See map_stat-0.jpg and

map_stat-1.jpg. Here is what gsMapper had to say about 'ReadStatus'

and 'PairStatus' after mapping:

Unmapped 609,527

Full 589,731

Partial 224,379

Repeat 572,589

Total 1,996,226

BothUnmapped 145,718

OneUnmapped 66,450

MultiplyMapped 312,034

TruePair 151,787

FalsePair 39,494

Total 715,483

Note about counts:

Looking at the individual reads, there are 715,483 of 1,290,176 reads

were split into pairs (leaving 574,693 reads that were not

split). This gives a total of (715,483 * 2) + 574,693 = 2,005,659

reads to map... not sure how to reconcile this inconsistency.

From the 454 manual:

The status string describes the read's fate in the alignment, and can

be one of the following four values:

Full - the read is fully aligned to the reference (every base)

Partial - only part of the read aligned to the reference

Repeat - the read aligned equally well to multiple locations in

the reference

Unmapped - the read did not align to the reference

TooShort - the trimmed read was too short to be used in the

computation (i.e., shorter than 50 bases, unless 454

Paired End Reads are included in the dataset, in which

case, shorter than 15 bases).

Status - the status of the pair in the mapping, with the following

possible values:

a. BothUnmapped - both halves of the pair were unmapped

b. OneUnmapped - one of the reads in the pair were unmapped

c. MultiplyMapped - one or both of the reads in the pair were marked

as Repeat.

d. TruePair - both halves of the pair were mapped into the same

reference sequence, with the correct relative direction, and are

within the expected distance of each other.

e. FalsePair - the halves were mapped to the same reference

sequence, but the directions of the alignment is inconsistent

with a Paired End pair or the distance between the halves is

outside the expected distance.

Looking at the TruePair distances, I found the distribution of

distances from all runs were almost IDENTICAL! See the

plot_distances.4.jpg and plot_distances.5.jpg, attached). As before

there is a small peak at 8 kb and a big peak at 14 kb (13 to 15

kb). The similarity in the distribution of insert sizes makes me

wonder how many unique reads we have.

Is coverage 'random'?

The high fraction of One or BothUnmapped read pairs (145,718 + 66,450

out of 715,483 ~= 30%) suggests either there is something wrong with

one or more of:

1) the DNA

2) the sequencing run,

3) the mapper (the mapping run), or

4) the assembly.

I'd like to run an alternative mapper (such as Mira) to rule out

option 3, and hopefully the 454 assembly (when it arrives) can be used

to rule out option 4. I guess this is an area for further

investigation.

Removing 'MultiplyMapped' we have:

145,718 + 66,450 out of 715,483 - 312,034 ~= 50%

BioBlog5000

Friday, 14 October 2011

Managing the sequence assembly life cycle

formlink magic

Overview of paired end data

About Me