[LINK] XML isn't evil, just misunderstood

Sat Nov 8 17:43:36 AEDT 2008

Craig,

First, thanks for the pointers to the xml[2] tools, which I will examine
to see if they make life easier...

OK; there's a utility in having the data and metadata in the same
"band", so to speak. But where it breaks (IMO only) is in usability. If
you don't want to have to learn the language, the data is
unapproachable. Or to put it another way, it's the time involvement that
gets me.

The Census data occupies 36 tables, which I had to learn once and only
needed an hour of reading. If, OTOH, I wanted to try and approach
deconstructing the XML in (say) Openstreetmap (or for that matter in the
XML used in various other geo-based services), then I have another layer
of knowledge required to approach the data.

My alternative is to have some other piece of software mediate between
the XML and me, but for the occasional user - someone who might want to
get at the XML-containered data once, and might not want it again - the
overhead is daunting.

So it comes down to horses-for-courses. Yes, in the circumstances you
describe, XML is useful; but its 'evil' (a rantingly-strong word, but
after a couple of days of pain I was in a ranting mood last night!) is
that its properties as an enabler are accompanied by restricting the
audience who can take advantage of that enablement. I'm a democrat; I'd
like to see an open format which achived both ends - better data
transfer, and more accessible data transfer!

But thanks again for the pointers to the tools, which simply didn't show
up in any of my desperate Googling ...

Cheers,
Richard Chrigwin

Craig Sanders wrote:
> On Fri, Nov 07, 2008 at 08:50:40PM +1100, Richard Chirgwin wrote:
>   
>> I've tried to see things in the light of standardisation, extensibility
>> and power, and I can't. XML is evil.
>>     
>
> actually, XML is a useful way of transmitting both the data and the
> meta-data describing that data.  The metadata not only gives the name of
> each field, it also describes the data-type of each field (e.g. integer
> or floating point number, string, etc), whether it is an optional or
> required field, whether it is unique per record (e.g. an ID field) or
> can have multiple instances (e.g. a list), and so on.
>
> these details are necessary because data often doesn't fit neatly into
> simple flat files like CSV - at least, not without either loss of detail
> (i.e. the data-file equivalent of lossy compression) or space and
> bandwidth wasting repetition, or both.
>
>
>
> it is true XML is often misused, and used inappropriately, but the same
> can be said of any technology.
>
> the most common misuse of XML is to regard it as a data storage protocol(*)
> when it should, for the most part, be seen as purely a data TRANSFER
> protocol - a convenient and documented way of moving data from one
> system to another without loss of descriptive meta-data.
>
>
> (*) e.g. databases with built-in XML storage engines to satisfy buzzword
> compliance.  Oracle is one of several offenders here.
>
>   
>> Just two examples.
>>
>> If I buy the ABS CDs, I get hundreds of CSV files, which with a little
>> script that took an hour to work out, once, I can import into a database
>> in very little time at all. The data set is huge, but everything works
>> easy-as.
>>     
>
> scripting is also fairly straight-forward with XML data files, with most
> scripting languages containing libraries/modules for working with XML.
> with the added bonus that you get to work with particular fields by
> *NAME* rather than by field *NUMBER* (which screws up if the field order
> changes or if any fields are added or deleted to the file format)
>
> e.g. out of the dozens of perl modules (some specialised, some generic)
> for working with XML files, my two favourites for Q&D scripting are:
>
> XML::Simple     - Easy API to maintain XML (esp config files)
>
> and
>
> XML::Mini       - Perl implementation of the XML::Mini XML create/parse ...
>
>
>   
>> Openstreetmap.org allows you to export maps. In X-M-damn-L. You need
>> a parser to do anything with the data, of which there are several,
>> none of which work properly. You cannot, without first studying XML
>> and poring over the schema, do anything off your own bat with the
>> data. You can't load the data into a database without a parser, which
>> won't work properly. You can't put the data into a GIS without a
>> parser, which won't work properly.
>>     
>
> there are numerous tools to convert XML data to flat-files like
> CSV or other formats - for example xml2[1].  They're relatively
> straight-forward and simple, and they're even easy to write *BECAUSE*
> the structure of the data is well-defined in an XML document, so there's
> no need to *guess* what any given field is.
>
> as for poring over the schema, at least there *IS* a schema to pore
> over. i've wasted many days out of my life trying to figure out what
> some undocumented field in the middle of a CSV file is for or, worse,
> trying to figure out exactly which fields in the CSV file contain the
> data elements i'm interested in extracting and reporting on (it's not
> always obvious, especially if you have CSV lines with dozens or even
> hundreds of fields all with similar looking data. e.g. a line with 20 or
> 30 numeric fields, only 2 or 3 are of interest to your current needs).
>
> some CSV files helpfully have the field names in a comment as the first
> line of the file. not all of them. it's very useful, but it doesn't
> solve all CSV annoyances.
>
> BTW, don't get me wrong. i'm not saying that CSV sucks and XML should
> be used for everything.  I'm saying that XML, like CSV, has its uses as
> well as its annoyances and limitations.  There are some kinds of data
> that fit perfectly in CSV-style one-line-per-record flat files.  And
> there are other kinds of data that just don't, which are better suited
> to a hierarchical structured data format like XML.
>
> i use both routinely, and it's really not that difficult to convert from
> one to the other as needed by your task at hand.
>
> where XML particularly shines is that it gives you the ability to say
> "ah, just give me a dump of everything in XML format and i'll extract
> what i need from that" instead of having to laboriously identify and
> list exactly which fields you want and ask for just them, only to
> find that you forgot one or more fields (or didn't know about - in my
> experience, you often don't know what fields you need until AFTER you've
> seen the data and if you only see a CSV-dump subset you'll never know
> that the field you really want is available).
>
>
>
> [1] xml2 - XML/Unix Processing Tools
>
> http://dan.egnor.name/xml2/
>
>
> from the man page for 'xml2':
>
> NAME
>        xml2 - convert xml documents in a flat format
>        2xml - convert flat format into xml
>        html2 - convert html documents in a flat format
>        2html - convert flat format into html
>        csv2 - convert csv files in a flat format
>        2csv - convert flat format into csv
>
> SYNOPSIS
>        <xml|2xml|html2|2html|csv2|2csv> > outfile < infile
>
> DESCRIPTION
>        There are six tools.  None of them take any command-line
>        arguments.  They are all simple filters which can be used to
>        read files from standard input in one format and output it to
>        standard output in another format.
>
>        The flat format used by the tools is specific to these tools.
>        It is a syntax for representing structured markup in a way
>        that makes it easy to process with line-oriented tools.  The
>        same format is used for HTML and XML; in fact, you can think
>        of html2 as converting HTML to XHTML and running xml2 on the
>        result; likewise 2html and 2xml.  (Of course, this isn't how the
>        implementation works.)
>
>
>
> similarly, there are tools for analysing an XML file (with or without
> the associated DTD) and giving you a compact summary of what kinds of
> records are in the file and what fields are available.  
>
>
> IMO, the self-documenting nature of XML more than makes up for the
> slight extra hassle of parsing/extracting from it, especially when that
> extra hassle is mostly handled automatically by existing tools and
> libraries.
>
> craig
>
>