[LINK] FW: FW: FW: Australia's online history 'facing extinction'

Paul Koerbin pkoerbin at nla.gov.au
Fri May 8 09:19:19 AEST 2009

As Stil points out in respect to the archiving of his site, the PANDORA Archive is in no way concerned only with the .au domain. The selection process is focused on content relating to Australia and Australians and the top level domain is irrelevent. The "domain harvests" are a different matter. They are based on starting seed lists and then scoping the crawler to follow links and collect within scope, which is based on domains and other location filters. So the large broad crawls are largely based on .au domain. In fact the NLA domain crawls also include non .au content that is found during the course of the crawl and which is then checked against a DNS look-up to identify it if the host is located in Australia. This of course still does not address the .com stuff that is located on hosts outside Australia. Currently that content is dealt with through the selective PANDORA Archive process..

As for the other question about embedded content, we generally take the view that this forms part of the target website. It forms part of what the user experiences through their browser view. There are obviously lots of complications in dealing with this, both technical and legal. In the selective approach we can identify if we need to seek additional permissions to include the embedded content; we can also set the harvest filters to include content; and, more often than not we can do the patch up work to make sure we pick up such content. A lot of this content is not gathered well by crawler robots and require additonal manual work to collect all the content. Due to the amount of work that can be involved in this, we often have to make decisions as to whether it is feasible to collect all such content. For example if we are collecting a blog, we may not collect all the embedded YouTube videos. These are curation decisions we need to make.


From: link-bounces at mailman1.anu.edu.au [link-bounces at mailman1.anu.edu.au] On Behalf Of Stilgherrian [stil at stilgherrian.com]
Sent: Friday, 8 May 2009 7:06 AM
To: Link list
Subject: Re: [LINK] FW: FW: Australia's online history 'facing extinction'

I'm stepping into this thread late, so I may have the wrong end of the
stick, but on these two points...

On 08/05/2009, at 12:47 AM, Michael Still wrote:
> - selecting just .au domains removes a large amount of Australian
> content. .au is a pain to register in, so many Australia sites
> use .com
> et al. Additionally, the fashion for vanity domains like nine.tv
> reduces
> the utility of this filtering techinque further.
> - selecting by hosting fails for similar reasons. Many Australia sites
> are hosted overseas for financial reasons.

... my own website at http://stilgherrian.com is neither a .au domain
nor hosted in Australia, but IS being archived in PANDORA, with a
snapshot taken each January for the last couple or years.

But, chucking in another issue, there's things "in" my website which I
consider to b part of "my" content but which are embedded from other
seeming-unrelated domains such as video content at Viddler and Ustream
and JavaScript-driven recording of liveblogs at CoveritLive. How can /
does / should an archive handle this?


Stilgherrian http://stilgherrian.com/
Internet, IT and Media Consulting, Sydney, Australia
mobile +61 407 623 600
fax +61 2 9516 5630
Twitter: stilgherrian
Skype: stilgherrian
ABN 25 231 641 421

Link mailing list
Link at mailman.anu.edu.au

More information about the Link mailing list