[LINK] As Go Document Formats, So Goes Video

Rick Welykochy rick at praxis.com.au
Fri Jan 4 00:52:29 AEDT 2008


Alastair Rankine wrote:

> steve jenkin wrote:
>>
>> I've never understood why archival documents aren't stored in "Lowest
>> Common Denominator" form.
>> I.e. ASCII lines until XML+Dublin Core (or whatever equiv) is generally
[SCHNIPPE]
> 
> I quite agree, if they can't use our character set in all its 7-bit 
> glory, all those foreigners can just bloody well learn English anyway.
[SCHNIPPE]
> </sarcasm>

Sarcasm accepted. Good points raised.
What comes to mind is the Voyager Golden Record.

<http://en.wikipedia.org/wiki/Voyager_Golden_Record>

It was sent out into deep space on the Voyager 1 and 2 probes.
The record is encoded in a format that should be decodable
by any intelligent species.

A feature of the Golden Record is that is starts from the simple
and proceeds to the complex. 

(*) understand binary as found on the craft's plaque
    <http://en.wikipedia.org/wiki/Pioneer_plaque>

(*) Read some analogue raster lines from the disk.

(*) Get a round circle to display from the data on the disk.

(*) etc. etc.

We are facing a similar but easier problem for digital archiving.

All approaches to digital archiving that I have seen assume that a
particular encoding is used in a specific language with a known
format to archive content.

Using the Voyager technique, with much credit to Carl Sagan,
start from as basic a definition as possible, and develop
the complexity as required.

Such a technique allows the archiver to specify as comprehensive
an encoding format (physical media) and as many and varied content
types (logical data) as occur in the archive.

The archiver can commit a proprietary RealMedia/12.5 format audio
stream to the archive, provided that the encoding format
and content type are defined in the DEFINITIONS for the archive
for RealMedia/12.5. If the definition of the proprietary format
cannot be obtained, convert it to a format that can be defined.

The principle is simple. Use the DEFINITIONS section of the
archive to define how the archive is specified, encoded, stored
and formatted. The archive itself then becomes instances of
content as simple or as complex as required.

DEFINITIONS

(*) Bootstrap the definitions with hard copy in English ... or
    French ... or in whatever language(s) you decide.
(*) Define the binary encoding, i.e. 8-bit octets.
(*) Define Unicode and its 7-bit ASCII subset.
(*) List the languages represented in the texts.
(*) Define the media formats: magnetic disc, optical ...
(*) Define the content formats: plain text, markup, audio, video ...
(*) Provide the definitions using Unicode in as many languages
    as required, the more the better.


ARCHIVE

(*) Archive the content per the DEFINITIONS



If you are archiving a pile of ASCII RFC's written in English
to a CD, the DEFINITIONS task is relatively simple.

OTOH, archiving all of the ABC's text, audio and video
requires meticulous care in the DEFINITIONS phase so that
the archive itself is straighforward and comprehensible.

Nothing precludes the archiving of any format or any presentation
layer, provided that it is DEFINED. You have an encyclopedia
written in XHTML + CSS? Fine. Archive it but first DEFINE
HTML + CSS. May I be so bold as to say that one can also DEFINE
an XHTML + CSS + Javascript document?


THE CATCH

Yes, there is always a catch. The one that comes to mind is archiving
executable content. This has already been discussed on LINK and
gives me a headache every time I think about it. 


cheers
rickw



-- 
_________________________________
Rick Welykochy || Praxis Services

Say what you will about the miracle of unquestioning faith, I consider 
a capacity for it terrifying and absolutely vile.
     -- Howard W Campbell in Kurt Vonnegut Jr's "Mother Night"



More information about the Link mailing list