[LINK] FW: FW: Australia's online history 'facing extinction'

Fri May 8 00:47:25 AEST 2009

Paul Koerbin wrote:
> In time we will do what we can to open access. Currently, we don't
> have the legal warrant to do so.

Why is a legal warrant required? Other archival and search sites manage
to provide access to this information without a complicated legal
framework. Is this a case of Australia lacking strong enough fair use
laws, or are you just being much more cautious than others?

> Crawling large amounts of content has its value, and we have
> obviously taken the opportunity to do so, but it comes with all sorts
> of problems, not the least of which is how to make it accessible.
> There are copyright problems and the fact that crawling a domain
> means we will no doubt have picked up all manner of material. To make
> that available, without knowing precisely what we have collected (in
> terms of the nature and specifics of the content) is also
> problematic.

I hope you're looking at some sort of automated filtering system
(although the consensus seems to be that those are flawed as well).
There are probably billions of unique documents on "the Australian web",
and hand evaluating them all is simply not possible.

> We would hope, in the first instance to make government
> material available; and perhaps in due course material we have
> permission for or may assume permissison based on Creative Commons
> licenses etc. Ultimately we would make it all available in some
> manner or other, otherwise, as you say, there would be no purpose.
> Point is you can't wait around for all the problems to be resolved
> before you try and collect the material. The UK, for example, has
> never done a domain crawl and was pretty slow to get into selective
> archiving (relying on the NLA's PANDAS system to get them underway a
> few years back).

I agree that this sort of archival should be done. However, I still
thing your approach is wrong:

 - selected archival is no better than collecting the encyclopedia
britannica, and declaring that you have a complete archive of human
knowledge.

 - selecting just .au domains removes a large amount of Australian
content. .au is a pain to register in, so many Australia sites use .com
et al. Additionally, the fashion for vanity domains like nine.tv reduces
the utility of this filtering techinque further.

 - selecting by hosting fails for similar reasons. Many Australia sites
are hosted overseas for financial reasons.

I think I favour some form of sitemaps based crawl instead of a seeded
or discovery crawl. Australians could be encouraged to submit sitemaps
to you, and then those should be archived, regardless of location /
domain name. You could build trust metrics around this as well, where
once a complaint is made about archived content you track that back to
the sitemap submitter, and apply extra examination to other URLs
submitted by that person.

Finally, I am not convinced your engineering approach is correct. Why
not partner is a large scale crawler like a search engine or the CSIRO
search project (whose name escapes me)?

I've asked all these questions in the past, and members of the PANDORA
project have consistently failed to address them. Perhaps that's just
because they're not interested in outside feedback on their pet project.

Mikal