[LINK] XML isn't evil, just misunderstood

Tue Nov 11 09:35:30 AEDT 2008

On Tue, Nov 11, 2008 at 08:57:35AM +1100, Marghanita da Cruz wrote:
> > (b) I can't, for the life of me, find a 'generic' parser - a stand-alone
> > app that would (for example) take an arbitrary XML file and produce a
> > table of "tags" so I can make sense of it. That would be a godsend ...
> > but it seems my Googling is letting me down.
> <snip>
> 
> My experience of XML is limited to RSSFeeds, but I think what you're
> looking for is something to generate a DTD from a data file.

sounded more like he wants a simple list of fields ("tags"). such a list
can be produced very easily with the xml2 tool i mentioned on Saturday,
combined with sed and sort.

e.g. "shepherd" (a program to download TV guide listings and prepare
them for insertion into MythTV's database) has an intermediary
file in XMLTV format (which is a de-facto standard XML format for
television-related data. there are numerous programs written to use,
manipulate, and/or generate TV data in this format). 

to get a list of "fields" for each "record" in this file, you'd run
something like this:

xml2 <~/.shepherd/output.xmltv >/dev/stdout | sed -e 's/=.*//' | sort -u

(note: i'm using the terms "records" and "fields" quite loosely,
although they're accurate enough for this example.)

that converts the XML file into xml2's "flat" format, strips off
everything after the '=' sign on each line, and then produces a sorted
listing of unique lines - i.e. the "field" names.

running it shows that there is an overall "tv" type, which contains both
"tv/channel" and "tv/programme" records. each of which has different
fields. 

to demonstrate:

$ xml2 <~/.shepherd/output.xmltv >/dev/stdout | sed -e 's/=.*//' | sort -u
/tv/channel
/tv/channel/display-name
/tv/channel/display-name/@lang
/tv/channel/@id
/tv/@generator-info-name
/tv/programme
/tv/programme/category
/tv/programme/category/@lang
/tv/programme/@channel
/tv/programme/country
/tv/programme/country/@lang
/tv/programme/credits/actor
/tv/programme/credits/director
/tv/programme/credits/writer
/tv/programme/date
/tv/programme/desc
/tv/programme/desc/@lang
/tv/programme/episode-num
/tv/programme/episode-num/@system
/tv/programme/icon/@src
/tv/programme/language
/tv/programme/language/@lang
/tv/programme/last-chance
/tv/programme/last-chance/@lang
/tv/programme/length
/tv/programme/length/@units
/tv/programme/premiere
/tv/programme/premiere/@lang
/tv/programme/previously-shown
/tv/programme/rating
/tv/programme/rating/@system
/tv/programme/rating/value
/tv/programme/star-rating/value
/tv/programme/@start
/tv/programme/@stop
/tv/programme/sub-title
/tv/programme/sub-title/@lang
/tv/programme/subtitles/@type
/tv/programme/title
/tv/programme/title/@lang
/tv/programme/url
/tv/programme/video/aspect
/tv/programme/video/colour
/tv/programme/video/quality
/tv/@source-info-name

Note the hierarchical structure of the "fields". it's self-evident that
each "field" can have any number of sub-fields (each of which is can
also have sub-fields. and so on). e.g. "tv/programme/credits" can have
credit details for actors, writers, and directors.

also note the field names with '@' in them.  They're obviously metadata
about another similarly-named "field".

e.g. /tv/programme/category/@lang details the language that the contents
of the /tv/programme/category field is written in.

you can get a much shorter list of just the non-meta fields with:

$ xml2 < ~/.shepherd/output.xmltv >/dev/stdout | grep -v '@' | sed -e 's/=.*//' | sort -u
/tv/channel
/tv/channel/display-name
/tv/programme
/tv/programme/category
/tv/programme/country
/tv/programme/credits/actor
/tv/programme/credits/director
/tv/programme/credits/writer
/tv/programme/date
/tv/programme/desc
/tv/programme/episode-num
/tv/programme/language
/tv/programme/last-chance
/tv/programme/length
/tv/programme/premiere
/tv/programme/previously-shown
/tv/programme/rating
/tv/programme/rating/value
/tv/programme/star-rating/value
/tv/programme/sub-title
/tv/programme/title
/tv/programme/url
/tv/programme/video/aspect
/tv/programme/video/colour
/tv/programme/video/quality

be careful doing this, though, because it can lose useful info. the
@start and @stop fields, for example.

another thing to be wary of is the fact that using this tool like this
does not give you a summary of everything that is possible with an
XML file in a particular format, it only shows you the subset which is
actually in the example XML file that you are working with.

I think that many of Richard's problems with XML are due to the fact
that he doesn't yet know of the tools to manipulate XML files.  Most of
those problems will disappear as he learns of the tools and learns how
to use them.

craig

-- 
craig sanders <cas at taz.net.au>