Flawed AltaVista Internet Search Engine

Roger Clarke Roger.Clarke@anu.edu.au
Thu, 27 Mar 1997 18:05:29 +1100 (EST)


Forwarded from Red Rock Eater via Ooi Chuin Nee.

One of the significant things about this message is that a Comp Sci person
and two Info Sys people are involved in saying 'it's about time we had a
yarn with a few librarians about this' ...


Phil Agre (Mr rre himself) writes:
> [It seems to me that the whole keyword-based search engine paradigm on the
> Web collapsed back in the fall sometime.  At least that's when I stopped
> being able to find anything on the Web using Lycos, Alta Vista, etc unless
> I had an obviously unique set of words to search on, if then.  Now that
> the Web has outgrown indexing and search methods that librarians rejected
> decades ago, maybe it will come time to get some serious ideas about the
> subject.  We may even have to listen to the librarians' opinions.  Now,
> some people are out there trying to catalog the Web using library cataloging
> principles.  But (as the librarians well know) that doesn't work because
> URL's are too impermanent; I've given up trying to cooperate with people who
> think they're cataloging Web-based periodicals such as The Network Observer.
> We need some different metaphors for cataloging and for the Web.  Once we
> get over this IPO-driven mania about "push" technology, maybe we can get
> back to business and rethink what it means to order information in a totally
> decentralized environment.]
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> This message was forwarded through the Red Rock Eater News Service (RRE).
> Send any replies to the original author, listed in the From: field below.
> You are welcome to send the message along to others but please do not use
> the "redirect" command.  For information on RRE, including instructions
> for (un)subscribing, send an empty message to  rre-help@weber.ucsd.edu
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> Date: Wed, 26 Mar 1997 08:20:28 -0500
> From: John Pike <johnpike@fas.org>
> To: pagre@weber.ucsd.edu
> Subject: Flawed AltaVista Internet Search Engine
>
> "As web-surfing enthusiasts already know, AltaVista is a program
> that will search the entire Web..." was the way Amy Schwartz
> introduced a review of the new book "The AltaVista Search
> Revolution" on the oped page of the Washington Post ["The
> Information Laundromat" 22 March 1997].
> http://discuss.washingtonpost.com/wp-srv/WPlate/1997-03/22/015L-032297-idx.h
> tml
>
> While AltaVista is indeed an estimable implementation, most
> web.surfers will be astonished to learn that, contrary to this
> conventional wisdom, AltaVista indexes only a small, flawed,
> arbitrary and not even random sample of what is on the web today.
>
> Estimates of the total content of the web are of necessity
> speculative, but run as high as 150 million pages. AltaVista
> claims  < http://altavista.digital.com/ > to be "the largest Web
> index: 31 million pages found on 476,000 servers." So where are
> the missing pages ?? [or as Ronald Reagan asked "where is
> the rest of me??].
>
> There are many reasons a web page might not show up in the
> AltaVista index. Some parts of some sites are hidden from public
> view with the Robots Exclusion Protocol, which tells search
> engines not to index certain pages. Other types of content, such
> as the Adobe Portable Document Format [PDF] do not currently
> support indexing. Some large sites dynamically generate
> their content, rendering it invisible to search engines. And other
> sites have security access controls which may [or may not!!!
> but that is another story.... ] preclude indexing their pages.
>
> But surely this does not explain why the estimable AltaVista
> indexes only 20% of the web.
>
> The AltaVista FAQ sez:
> http://altavista.digital.com/cgi-bin/query?pg=tmpl&v=faq.html
>
> >How do I submit my site to AltaVista?
> >Use our Add URL feature, found at the bottom of every
> >page. Simply type in the main URL for your site. You can
> >submit several URLs, but it is considered bad taste to
> >manually submit your entire site: just let Scooter do this for you.
>
> This certainly creates the impression that once AltaVista has even
> one URL from a site, it will automatically [in the fullness of time,
> but that is another story as well....] include the entire site in
> its widely used index. Certainly, this claim is the reason that
> AltaVista is so widely relied upon, and the reason that most
> web.users assume that "if it ain't in AltaVista, it ain't online"
>
> I webmaster the Federation of American Scientists site,
>
>                     http://www.fas.org/
>
> which is a medium-sized website with some 6,000 pages and about 1/2 Gig
> online. Recently I noticed that the Alta Vista search engine seemed to only
> index about 600 of our pages. I thought that this was rather odd, since I had
> long had the impression that AltaVista indexed pretty much everything, or at
> least made a good-faith best effort to do so. I asked them about this, and
> this is what I got back:
>
> >Date: Tue, 18 Mar 1997 09:08:39 -0800 (PST)
> >From: Alta Vista Support
> >To: johnpike
> >Subject: Re: AltaVista not indexing www.fas.org
> >That is probably a good estimate...We have 600 pages from you indexed in
> >the system. You will probably not see much more than that for any one
> >domain. Goecities has 300...and they have 300,000 members.
>
> I confess that I was rather horrified as I contemplated the implications of
> this [which can be verfied by searching AltaVista on < host:geocities.com >
> ... try this trick on your own domain and see what happens!!!].
>
> For a medium to large site, such as ours, it means that they are only
> indexing some arbitrarily selected subset of our total content. Thus
> corporations, universities, or most other really content-rich sites will be
> poorly represented in their index.
>
> It also means that for smaller entities that do not have their own domain,
> their content will also not be indexed. As in, are the reported 300,000 users
> of Geocities aware that the fact that their pages are hosted @
> www.geocities.com [or the larger number of folks who are hosted @
> members.aol.com] means that they are effectively invisible to AltaVista, one
> of the most widely used and admired search engines???
>
> What this seems to mean is that medium-sized sites of a few hundred
> pages are going to show up nicely in AltaVista, but larger and smaller
> implementations will be nearly invisible, which is a rather odd way of doing
> things. I mean, this is sorta like buying a map that shows some arbitrary
> number of roads but doesn't have any of the main interstates, or a phone
> book that only has even-numbered phone numbers, or something.
>
> I confess that I was not previously aware of this practice of AltaVista,
>which
> is certainly not been previously reported anywhere, and is certainly @
> variance with their apparent claims that if you supply them with one URL
> from their site they will spontaneously include the rest of their site in
> their
> index.
>
> This is not to trash AltaVista, which at least has an implementation that
> enables one to determine just how many of your pages are in their index [I
> can't seem to make the other engines do this neat trick]. But it is to say
> that anyone whose online presence has been predicated on their entire site
> [large or small] showing up in AltaVista had better think again. And that
> anyone trying to search the 'entire' web [as opposed to some arbitrary
> sample thereof] had best look somewhere other than AltaVista.
>
> Frankly, I think this is a more significant story than the widely
> reported "flawed Pentium chip" or "browser security flaws" stories.
> These highly visible episodes affected only a small number of
> users, or were more in the nature of theoretical problems. But
> AltaVista claims to be used nearly 30 million times a day,
> so this "undocumented feature" of AltaVista affects nearly
> everyone who uses the web [doesn't everyone???].
>
> As someone who uses AltaVista many times a day, and whose
> webpresence strategy had been predicated on "If I build it, they will come,
> cause they will find it in AltaVista" this has really come as a shock to me,
> and I imagine that it would come as a shock to many others as well. I
> mean, it is one thing to admit that regenerating a web.wide index takes a
> long time, and that your index goes stale after a month or so, but it is
> another to admit that you are just not even trying to index large sites, or
> small sites that are appended to an ISP's domain, and I am pretty
> astounded.
>
> To keep track of this issue Melee's Indexing Coverage Analysis (MICA)
>
> 	http://www.melee.com/mica/index.html
>
> examines the relative page coverage for a select group of search
> engines. Each week, Melee Productions will retest the engines
> on the list and publish an update to the MICA Report. They
> will be happy to test any publicly accessible search engine that
> supports date-range and host/domain constraints, and purports to
> index at least one fifth of the "web".
>
> Stay tooned for further developments!!!
>
>
>
> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>
> John Pike
> Director, Space Policy Project
> Federation of American Scientists
> 307 Massachusetts Ave. NE
> Washington, DC 20002
> V 202-675-1023,   F 202-675-1024,  http://www.fas.org/spp/
>
>
>
-- End forwarded message --

--
OOI Chuin Nee, ETC Electronic Trading Concepts
Any opinions expressed above are solely mine.
Email: C.Ooi@etc.com.au, Chuin-Nee.Ooi@etc.com.au
Tel: +61 2 9299 4755 Fax: +61 2 9299 4544 Mob: 041 926 8594
The ETC WWW site is at: http://www.etc.com.au


Roger Clarke              http://www.anu.edu.au/people/Roger.Clarke/
                                        http://www.etc.com.au/Xamax/
Xamax Consultancy Pty Ltd, 78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 6 288 1472, and 288 6916     mailto:Roger.Clarke@anu.edu.au

Visiting Fellow,   Faculty of Engineering and Information Technology
The Australian National University     Canberra  ACT  0200 AUSTRALIA
Information Sciences Building Room 211        Tel:  +61  6  249 3666