Sunday, 5 February 2012

Time to update your MediaWiki?

Newer versions of MediaWiki get better and better (see the release notes). I'm constantly impressed with this project. It produces a slick**, usable, and 'web-scale' (yes, I said web-scale) software for free, and runs perhaps the most important resource on the internet.
        ** Easy to install, configurable, language independent, easy to upgrade, easy to extend, and extend, ...

Newer versions are not only easy to install, add constantly improving usability and functionality, but also fix important security issues (more info, more reasons). There are only two or three real reasons not to update (and they are all arguable):

  • You forked the codebase instead of using (or developing) the extension mechanism (silly you!), or
  • You rely on functionality provided by an extension that is incompatible with newer versions of MW (poor you!), or
  • You work in a production environment with an old version of PHP, and upgrading PHP is likely to be problematic.


However, (politely?) asking people to update in no uncertain terms can result in being slapped down...

Subject: Re: Dear wiki owners

Date: Thu, 14 Apr 2011 12:20:57 +0200
Cc: Everyone I was pestering
To: Dan Bolser 

Dear Dr. Bolser,

I hope you must be joking.

Although we are happy to collaborate to the best of our abilities and resources with any project that might benefit from it, I object to the arrogant tone of your message. Our XXXwiki web site is functional and we do have to juggle many priorities in our development work. If we can upgrade our system without breaking it, we will consider doing the upgrade quickly.

In the future, I do suggest that you approach people you don't know personally in a more diplomatic way. It might be just as helpful for the people you know personally too.

In the meanwhile, do whatever you please with the BioWiki catalogue.

Sincerely,

Xxxx Xxxxxx



On 14 Apr 2011, at 00:41, Dan Bolser wrote:

> Hi all,
> 
> Just a quick weekly reminder that your wikis, XXXwiki, XxxxxxxWiki,
> XXXX wiki and Xxxxxxxxxxx, are woefully out of date and/or chronically
> and needlessly broken (see below).
> 
> I'm happy to take good excuses or to remove your sites from the
> BioWiki catalogue. Until then, I'll probably keep sending weekly
> reminders.
> 
> 
> Faithfully,
> Dan Bolser.



Yeah... I guess I over did it.

DNA as a digital data storage medium

This was an idea to use DNA to store data, since it's now very cheap to sequence DNA. In the same way people use magnetic tape for long term storage, you could back up to DNA...

The idea was developed with the help of Utopiah, Kanzure, and others in irc://irc.freenode.net/##hplusroadmap

and was added to the awesome (but sadly now historical) Seedea project, here:
http://fabien.benetou.fr/innovativ.it/www/HistoricalArchives/Seedea/Oimp/MetaDNA

where details were thrashed out by various people.

This idea depends on DNA printing technology.





EDIT: Here is the text from the Seedea page (around 2009)



metaDNA

General discussion

use the dedicated page.

Goal

To encode data as DNA, allowing the storage of vast quantities of data 'in a cupboard'. Advantage: we get massive data transfer rates by shipping DNA (e.g using UPS).

Abstract

New advances in DNA sequencing technology promise to revolutionize the fields of biology and health care. The human genome project, initiated in 1990, took just 13 years to complete at a cost of approximately $3 Bn [cite HGP]. Today, obtaining the complete sequence of an individual costs 100,000 times less at approximately $30,000, and takes approximately 1 day to obtain [cite 1000 GP]. The acceleration in the progress of DNA sequencing technology shows no sign of slowing. It is estimated that capacity increases by a factor of 100 every year [cite Richard Durbin, personal communication].
However, these phenomenal technological advances in the field of molecular biology, they have created a new bottleneck in the scientific discovery pipeline. Namely, the cost of data storage.
DNA can store 1021 bits per gramhttp://www.sciencemag.org/cgi/reprint/296/5567/478.pdf. This compares favorably with conventional storage at around 1014 bits per gram, (blu-ray: 200GB/16g) for a one-million-fold improvement. How to effectively utilize this awesome storage capacity?
Here we propose an alternative storage medium for long term archival of data, DNA. We present a DNA encoding algorithm that is optimized for data recovery, outline a novel design for a microfluidic DNA sequencing chip and describe a DNA protectant that will allow for long term storage of DNA in ambient conditions.

Problem abstraction

The problem with 'next generation' DNA sequencing (nextGen) is that it is too good. The technologies are generating too much data too quickly. Simply put, we don't have enough hard disks to keep pace with the data storage requirements.
How do you cope with this situation?

In situ example

  1. Company A gather a sample S from a living organism
  2. Company A studies it and produces a result R that is a very large amount of data including specific DNA samples (original and modified)
  3. Company A works for a Client K that requires additional work on R and eventually S by company B
  4. R+S are information that needed to be shipped as fast as possible by A to B
  5. We encode R+S in P thanks to our specific method and ship it to B

Proposed solutions

Design a 'DNA encoding' that maximizes ease of reading
  1. the DNA encoding - lots of check sums and handling of repeat regions
  2. the 'DNA protectant molecule that we use to store data at rtp

Complete process

  1. ?
  2. design a micro fluidic dna sequencing chamber
  3. ?

Opportunity

microfluidics is getting very cheap, so its easy to design and print a 'chip' that will control the flow of ATCG into a reaction chamber.

Business model

  • cost optimization to advertize during the difficult time of "pipe cloging" and energy cost (logistic, network congestion)

Market and trends

  • Familybuilder DNA on Sale: Familybuilder Introduces Low Cost Testing
  • AsperBio How to send DNA samples?

Alternatives

  • efficient data compression. One human is much like another at the genetic level. Perhaps we can simply compress the data produced by the personal genomics initiative (for example) against a 'reference' genome.

To explore

  • "retarded polymerase"

References

  • article A from X written by Y on date Z
  • book A from X written by Y on date Z
  • ...

Important related patents

Relations that would be interested

  • Paola (positive about the idea but doubtful about the ethical or moral slippery)
  • Laurent (feedback yet to ask)
  • Kenza (feedback yet to ask)
  • Contacts from KAO (genetic engineering in the BayArea)

Marketing

Project name

  • metaDNA doesn't really sums it up.
  • EnCodeMe
  • DNAStore
  • DNA bank
  • Molecular Storage
  • DNCode
  • DNAta
  • DNA backup
  • DNA Storage
  • DNA data
  • DNA backup
  • backDNA
  • DNAstics
  • DNA Logistics

Tag line

DNA is a fantastic storage medium. It has a track record of 4 billions years.
DNA, it'll store your ass off.

To explore

SEQwiki ontology integration



TBD...

I need to read, digest, and implement information from these sources:
* http://semantic-mediawiki.org/wiki/Help:Ontology_import
* http://semantic-mediawiki.org/wiki/Help:Import_vocabulary

Hmm... if smw.org is still down, use cached versions:
http://www.google.com/search?q=semantic+mediawiki+ontology

SEQwiki integration with NeuoLex


Topic 1) Sharing data between SMWs:

There are three broad ways to do it (if not more):
** If you want to cut to the chase, skip to method 3.

1) Export / import of pages.

The most reliable but least 'dynamic' way to share content between SMWs is to use (automated?) export / import of pages using standard MW functionality (http://meta.wikimedia.org/wiki/Help:Export).

This isn't a great solution for several reasons:

  • By copying pages around, we require the wikis to synchronize the SMW structure precisely, including properties, templates, forms, etc. This isn't ideal, given the different focus of the two wikis, the effort required to define an agreed standard, and subsequent migration overhead required to meet the standard.
  • Most importantly, synchronization (merging) of edits on the two different wikis will cause problems unless carefully managed. We could code something to perform merging, or implement policy that keeps edits separate. Both would take time to develop. There may be existing tools for doing this, but I don't know any...



2) Nice but 'unsupported' tools.

There are a couple of solutions that look nice, but the underlying projects are in an uncertain state:

2.1) Distributed Semantic MediaWiki (DSMW)
http://www.mediawiki.org/wiki/Extension:DSMW

Looks impressive, but I can't even find the documentation any more! The purpose isn't (wasn't?) 100% identical to what we want, as it seems to be aimed at mirroring *all* content (or more precisely, all edits) between two SMW instances, rather than just a subset of pages or properties. It's possible it could be tailored to our needs, but I'm not sure how it would handle different user accounts on the two wikis, for example. I think it needs some time to investigate if it's a viable solution, and again, it probably requires us to harmonize data models.

2.2) Remote queries via the Exhibit extension
http://semantic-mediawiki.org/wiki/Help:Exhibit_format

This is a 'format' plugin for SMWs #ask query. It has a lot of nice functionality for faceted display of query results (example, http://seqanswers.com/wiki/Service_Provider, see facets on the right).

This extension additionally has a simple mechanism for issuing #ask queries against a 'remote' SMW instance. This allows us to display (and potentially 'process') the results of a query on wiki X within wiki Y.

Unfortunately, the SMW format plugin project isn't being maintained, although the underlying 'exhibit' project is:
http://www.simile-widgets.org/exhibit/

However, the latter project isn't really relevant to our needs.

Perhaps we can just extract the 'remote' part of the code and create our own extension?


3) Using the External Data extension

This is the best solution I've found so far:
http://www.mediawiki.org/wiki/Extension:External_Data

Using this extension, we issue a query against the remote wiki (or a variety of other sources) and then display and/or annotate the results on the local wiki. This is dynamic, and flexible.

I've put a basic demonstration of a NeuroLex query on SEQwiki here:
http://seqanswers.com/wiki/NeuroLex_test

Configuring the query, the mapping between remote and local 'variables', and displaying and/or annotating the results is quite a painful process, but it does work (as demonstrated). To be clear, above I've only implemented displaying the results, not semantically annotating properties or creating 'object'

There are some clear and (relatively?) simple ways we can develop this extension to make it more useful and easier to deploy.

  • The mapping of remote and local 'variables' can be simplified (factored out).
  • The need for a separate annotation step could be removed, allowing us to 'replicate' remote properties and classes locally. This is a bit more complex, but doable.
  • Created objects should be viewable in Semantic Drilldown, which is a broader SIO/SD bug. Resolved by the native SIO support?



4) The semantic web solution?

It's annoying that SMW exports RDF and SPARQL (internally), but can't function as a SPARQL endpoint out of the box, and has no support for issuing queries against remote endpoints. I've pestered the developers about this a few times (I'm not offering to do any work, so I can't complain) and it seems like all this IS possible by 'plumbing' existing functionality. It just requires a few good programming hours.

If I were in a position to magically create a solution, I'd pick this one, as it not only solves our problem succinctly, but also has the widest possible benefit to the SMW community, which directly feeds back to us.

UPDATE:

I forgot to cover the various RDF/SPARQL enabled extensions that do this (please add comments and I'll work things in):


I got an email from Christian Becker mentioning this work:
http://www4.wiwiss.fu-berlin.de/bizer/ldif/

and previously I've seen the following work by Alfredas Chmieliauskas and Chris Davis:




5) Nice and supported tools:

SMWs ASK API is in early stages, but avoids the messy URLs involved with solution 3.


From Jesse Wang:
        I believe Wiki Object Model (WOM) can help here. It allows you to do a WOM API call. It forwards the query to the remote site (by WOM / ask API / ...), to get JSON data (or other supported formats) back. It requires bot wikis to be running WOM, so maybe it isn't a universal solution.




SPECIFIC TODO
* Configure SIO from the resulting query results.
* Think about how to implement the 'recreate' function vis. SIO

New purpose

I'm going to start using this blog as a work log and archive of random ideas and projects. Mostly for my own reference, as I've got too much stuff spread around different computers, notebooks, brains, emails, wikis, etc. etc.

If I can use this to keep track of all the things I want to get done, I figure that's progress... right?

I also like the idea of a public archive of ideas, not that there aren't 1001 'ideagoras' out there...