[LINK] Guidelines for Digital Repositories, Canberra, 27 July

Fri Jul 21 15:07:02 AEST 2006

On Fri, 2006-07-21 at 14:29 +1000, Markus Buchhorn wrote:

> Presuming you mean digital preservation of the content, rather than
> the physical media

However, physical preservation of the medium on which your digital
archive is written is itself a HUGE topic. Paper, good, old-fashioned
paper, is STILL overall the most robust medium known to humankind short
of chiselled rocks (which have a bandwidth even worse than paper). Until
we come up with an incorruptible (if not unbreakable) storage medium, we
will be spending a lot of time recopying.

Film, mag tape, CD, DVD, MO - they all have serious longevity problems.
And many have bulk problems. The digital version of a feature film can
occupy many times the space of the physical film.

> There's also the issue that basically anything that goes through an
> algorithm of any kind has to also preserve the algorithm itself -
> which can be fun :-/ (can get into issues of preserving languages,
> compilers and operating systems).

Preserving mathematical languages maybe; there is no real issue of
preserving human languages, computer languages, compilers or operating
systems. And don't forget reader technology. Over the tide of time, the
few years that may be needed to recreate a reader is negligible, but it
will be extremely difficult, verging on impossible, if the format is
compressed but not documented. For obvious reasons the documentation
needs to be on a DIFFERENT, simpler medium to the data.

Speaking to ourselves over a thousand years means that any data
reconstructed/restored will be using languages (in the broadest sense)
long since dead. The restoration of the data to its original form is the
least of the difficulties; comprehension of the result will be the
biggy.

> Regarding bit-rot, you can avoid that by reducing your compression
> ratio a bit, by including full error-correction information (basically
> checksums).

Um, it's VERY expensive if the data is voluminous. So expensive that it
is almost always cheaper to compress the hell out of stuff, store it
multiple times, protect it well, then keep recopying it and checking it
against itself. That is basically what biological systems do.

Oh, and lets not forget indexing this stuff. Just the indexes will be
vast quantities of data, with all the above problems. And without
indexes, the researchers of the future (if they are not standing beside
a pile of radioactive rubble) will be looking at the mother of all
midden heaps.

Personally, I think most of this is hubris. We are many; the future will
certainly find enough of our bones to work with.

Regards, K.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer at biplane.com.au)                   +61-2-64957160 (h)
http://www.biplane.com.au/~kauer/                  +61-428-957160 (mob)