[LINK] A new metadata tag proposal
Dr. Bob Jansen
Bob.Jansen@turtlelane.com.au
Tue, 27 Nov 2001 16:18:46 +1100
>Hi there,
>
>Here's an interesting proposal. A Nicholas Carroll of Hasting Research
>has posted a paper proposing a new "nonwords" metadata tag. For eg,
>see The Anti-Thesaurus: A Proposal For Improving Internet Search While
>Reducing Unnecessary Traffic Loads' which explains how this may benefit
>everyone. See http://www.hastingsresearch.com/net/06-anti-thesaurus.shtml
>
>Cheers, Darelyn
>Stephen Loosley
>stephen@schools.net.au
Whilst working on information retrieval systems, such as Status,
whilst at ICL and latter doing R&D at CSIRO into retrieval
technologies, I learned very quickly that stop words, such as the
proposal mentioned, only work effectively in well defined contexts.
For example, we did a prototype of the Hansard system for the Fed
Parliamentary Library and concluded that we could only have one stop
word, the word 'a', since all other potential stop words could be
used as acronyms etc.
The other interesting thing is that the words we might pick as stop
words (or their alternative, as go words) are not always what we
might expect. The word distribution in most document spapces follws a
bell curve. From a retrieval aspect, the words that most discriminate
are those words on the left and right hand sides of the curve and not
those in the middle. Thus the stop words should come from the middle
set, ie, those words that occur frequently since they are less
discriminating.
There are some interesting books on this, for example, Information
Retrieval by van Rijbgergen
bobj
Dr. Bob Jansen
Managing Director,
Turtle Lane Studios Pty Ltd,
Physical: 1 Turtle Lane, Erskineville, NSW 2043, AUSTRALIA
Postal: PO Box 26, Erskineville, NSW 2043, AUSTRALIA
Phone & Fax: +61-2-95 19 99 85
Mobile: 0414 297 448
Email: Bob.Jansen@turtlelane.com.au
WWW: http://www.turtlelane.com.au/
____________________________________________________________
Go to http://www.turtlelane.com.au to see how we can help you with
* Events On Line - web sites utilising streaming video and audio
synchronised with images and text
* Information Architecture - planning and designing your information system
* Electronic Publishing - publish your digital content to its full
effectiveness
and not just electronic paper
* HACKER ALERT - a system for protecting your web site against
hackers and site defacement
* SWOTRecorder - a program to assist you in recording the results of
your SWOT analyses