As discussed, the work plan is very light on details of how the
informatics infrastructure will support the ongoing sequencing
efforts. Clearly we need to implement some infrastructure or
guidelines to manage the growing volume of data. below are some rough
notes to be put forward for inclusion in the Work Plan.
Managing the sequence assembly life cycle
Several partners are independently collecting sequence data and/or
working on assemblies as well as pursuing their own research interests
using all the available resources. This is a natural situation that is
to be expected on any large international sequencing project.
We would like to outline some simple guidelines for effectively
sharing and managing data. These guidelines apply both internally
between partners and for external, public releases [Although Robin
said she would write something about the latter].
From a bioinformatics perspective, we would like to promote a 'release
early release often' philosophy with regard to the sequence data and
assemblies. We believe that this approach has several important
advantages and, if properly managed, very few disadvantages.n
Guidelines for releasing sequencing data:
1) To help partners gain informatics experience with handling sequence
data, we suggest that all (or some portion of) the raw sequence data
should be made available from a centralised FTP site. Sharing raw data
in this way not only has educational benefits, it will encourage
rigorous validation of sequence assembly methods employed by the
partners, allowing comparison of assemblies prepared by different
methods.
2) Assemblies should be made available on a centralised FTP site. To
avoid confusion between assembly versions, each assembly should be
clearly tagged with an version number. Any change to the assembly
would trigger an update in the version number, however, major
(relatively stable) versions should be clearly documented for more
general use. We believe that an initial unstable assembly can provide
a lot of usefully information, but should be accurately tracked and
treated with appropriate caution.
3) For the reasons stated above partners should provide descriptions
of the methods used to produce a given assembly as clear 'standard
operating procedures' (SOPs).
What other guidelines can we suggest? What other 'support' or
management infra can we reasonably implement?
-- Reiteration of the points made above
From a biological perspective, working with very rough or highly
dynamic assemblies may be counter productive. However, from a
bioinformatics perspective, even very preliminary data is desirable
for a number of reasons:
* Can be used to provide experience and training with different data
types and data produced on different platforms.
- For example, we would like to study combined assembly using Solexa
and 454 data. The best way to gain understanding of the methods to get
hands on experience.
* Can be used to provide realistic input to explore various
algorithms or bioinformatics methods that are under development.
- For example, if we wanted to map the Solexa data to the sequenced
RH BACs to look for SNPs, we could rigorously compare methods by using
real data.
* Allows independent validation of the data, so that material
improvements between releases can be independently and objectively assessed.
- For example we could be interested in developing a pipeline to
'scaffold' BACs using the assembly, and these scaffolds could be
compared to scaffolds produced by other methods. We would expect each
new release of the assembly to improve the performance of these
methods, and errors could be feedback between partners to improve the
overall levels of sequencing quality.