[LINK] A new metadata tag proposal

Dr. Bob Jansen Bob.Jansen@turtlelane.com.au
Tue, 27 Nov 2001 16:18:46 +1100


>Hi there,
>
>Here's an interesting proposal. A Nicholas Carroll of Hasting Research
>has posted a paper proposing a new "nonwords" metadata tag.   For eg,
>see The Anti-Thesaurus: A Proposal For Improving Internet Search While
>Reducing Unnecessary Traffic Loads' which explains how this may benefit
>everyone. See http://www.hastingsresearch.com/net/06-anti-thesaurus.shtml
>
>Cheers, Darelyn
>Stephen Loosley
>stephen@schools.net.au

Whilst working on information retrieval systems, such as Status, 
whilst at ICL and latter doing R&D at CSIRO into retrieval 
technologies, I learned very quickly that stop words, such as the 
proposal mentioned, only work effectively in well defined contexts. 
For example, we did a prototype of the Hansard system for the Fed 
Parliamentary Library and concluded that we could only have one stop 
word, the word 'a', since all other potential stop words could be 
used as acronyms etc.

The other interesting thing is that the words we might pick as stop 
words (or their alternative, as go words) are not always what we 
might expect. The word distribution in most document spapces follws a 
bell curve. From a retrieval aspect, the words that most discriminate 
are those words on the left and right hand sides of the curve and not 
those in the middle. Thus the stop words should come from the middle 
set, ie, those words that occur frequently since they are less 
discriminating.

There are some interesting books on this, for example, Information 
Retrieval by van Rijbgergen

bobj

Dr. Bob Jansen
Managing Director,
Turtle Lane Studios Pty Ltd,
Physical: 1 Turtle Lane, Erskineville, NSW 2043, AUSTRALIA
Postal: PO Box 26, Erskineville, NSW 2043, AUSTRALIA
Phone & Fax: +61-2-95 19 99 85
Mobile: 0414 297 448
Email: Bob.Jansen@turtlelane.com.au
WWW: http://www.turtlelane.com.au/

____________________________________________________________

Go to http://www.turtlelane.com.au to see how we can help you with
* Events On Line - web sites utilising streaming video and audio
                                synchronised with images and text
* Information Architecture - planning and designing your information system
* Electronic Publishing - publish your digital content to its full 
effectiveness
                                            and not just electronic paper
* HACKER ALERT - a system for protecting your web site against 
hackers and site defacement
* SWOTRecorder - a program to assist you in recording the results of 
your SWOT analyses