BioBlog5000: Managing the sequence assembly life cycle

As discussed, the work plan is very light on details of how the

informatics infrastructure will support the ongoing sequencing

efforts. Clearly we need to implement some infrastructure or

guidelines to manage the growing volume of data. below are some rough

notes to be put forward for inclusion in the Work Plan.

Managing the sequence assembly life cycle

Several partners are independently collecting sequence data and/or

working on assemblies as well as pursuing their own research interests

using all the available resources. This is a natural situation that is

to be expected on any large international sequencing project.

We would like to outline some simple guidelines for effectively

sharing and managing data. These guidelines apply both internally

between partners and for external, public releases [Although Robin

said she would write something about the latter].

From a bioinformatics perspective, we would like to promote a 'release

early release often' philosophy with regard to the sequence data and

assemblies. We believe that this approach has several important

advantages and, if properly managed, very few disadvantages.n

Guidelines for releasing sequencing data:

1) To help partners gain informatics experience with handling sequence

data, we suggest that all (or some portion of) the raw sequence data

should be made available from a centralised FTP site. Sharing raw data

in this way not only has educational benefits, it will encourage

rigorous validation of sequence assembly methods employed by the

partners, allowing comparison of assemblies prepared by different

methods.

2) Assemblies should be made available on a centralised FTP site. To

avoid confusion between assembly versions, each assembly should be

clearly tagged with an version number. Any change to the assembly

would trigger an update in the version number, however, major

(relatively stable) versions should be clearly documented for more

general use. We believe that an initial unstable assembly can provide

a lot of usefully information, but should be accurately tracked and

treated with appropriate caution.

3) For the reasons stated above partners should provide descriptions

of the methods used to produce a given assembly as clear 'standard

operating procedures' (SOPs).

What other guidelines can we suggest? What other 'support' or

management infra can we reasonably implement?

-- Reiteration of the points made above

From a biological perspective, working with very rough or highly

dynamic assemblies may be counter productive. However, from a

bioinformatics perspective, even very preliminary data is desirable

for a number of reasons:

* Can be used to provide experience and training with different data

types and data produced on different platforms.

- For example, we would like to study combined assembly using Solexa

and 454 data. The best way to gain understanding of the methods to get

hands on experience.

* Can be used to provide realistic input to explore various

algorithms or bioinformatics methods that are under development.

- For example, if we wanted to map the Solexa data to the sequenced

RH BACs to look for SNPs, we could rigorously compare methods by using

real data.

* Allows independent validation of the data, so that material

improvements between releases can be independently and objectively assessed.

- For example we could be interested in developing a pipeline to

'scaffold' BACs using the assembly, and these scaffolds could be

compared to scaffolds produced by other methods. We would expect each

new release of the assembly to improve the performance of these

methods, and errors could be feedback between partners to improve the

overall levels of sequencing quality.

BioBlog5000

Friday, 14 October 2011

Managing the sequence assembly life cycle

No comments:

About Me