[LINK] XML isn't evil, just misunderstood

Sat Nov 8 10:30:03 AEDT 2008

On Fri, Nov 07, 2008 at 08:50:40PM +1100, Richard Chirgwin wrote:
> I've tried to see things in the light of standardisation, extensibility
> and power, and I can't. XML is evil.

actually, XML is a useful way of transmitting both the data and the
meta-data describing that data.  The metadata not only gives the name of
each field, it also describes the data-type of each field (e.g. integer
or floating point number, string, etc), whether it is an optional or
required field, whether it is unique per record (e.g. an ID field) or
can have multiple instances (e.g. a list), and so on.

these details are necessary because data often doesn't fit neatly into
simple flat files like CSV - at least, not without either loss of detail
(i.e. the data-file equivalent of lossy compression) or space and
bandwidth wasting repetition, or both.

it is true XML is often misused, and used inappropriately, but the same
can be said of any technology.

the most common misuse of XML is to regard it as a data storage protocol(*)
when it should, for the most part, be seen as purely a data TRANSFER
protocol - a convenient and documented way of moving data from one
system to another without loss of descriptive meta-data.

(*) e.g. databases with built-in XML storage engines to satisfy buzzword
compliance.  Oracle is one of several offenders here.

> Just two examples.
> 
> If I buy the ABS CDs, I get hundreds of CSV files, which with a little
> script that took an hour to work out, once, I can import into a database
> in very little time at all. The data set is huge, but everything works
> easy-as.

scripting is also fairly straight-forward with XML data files, with most
scripting languages containing libraries/modules for working with XML.
with the added bonus that you get to work with particular fields by
*NAME* rather than by field *NUMBER* (which screws up if the field order
changes or if any fields are added or deleted to the file format)

e.g. out of the dozens of perl modules (some specialised, some generic)
for working with XML files, my two favourites for Q&D scripting are:

XML::Simple     - Easy API to maintain XML (esp config files)

and

XML::Mini       - Perl implementation of the XML::Mini XML create/parse ...

> Openstreetmap.org allows you to export maps. In X-M-damn-L. You need
> a parser to do anything with the data, of which there are several,
> none of which work properly. You cannot, without first studying XML
> and poring over the schema, do anything off your own bat with the
> data. You can't load the data into a database without a parser, which
> won't work properly. You can't put the data into a GIS without a
> parser, which won't work properly.

there are numerous tools to convert XML data to flat-files like
CSV or other formats - for example xml2[1].  They're relatively
straight-forward and simple, and they're even easy to write *BECAUSE*
the structure of the data is well-defined in an XML document, so there's
no need to *guess* what any given field is.

as for poring over the schema, at least there *IS* a schema to pore
over. i've wasted many days out of my life trying to figure out what
some undocumented field in the middle of a CSV file is for or, worse,
trying to figure out exactly which fields in the CSV file contain the
data elements i'm interested in extracting and reporting on (it's not
always obvious, especially if you have CSV lines with dozens or even
hundreds of fields all with similar looking data. e.g. a line with 20 or
30 numeric fields, only 2 or 3 are of interest to your current needs).

some CSV files helpfully have the field names in a comment as the first
line of the file. not all of them. it's very useful, but it doesn't
solve all CSV annoyances.

BTW, don't get me wrong. i'm not saying that CSV sucks and XML should
be used for everything.  I'm saying that XML, like CSV, has its uses as
well as its annoyances and limitations.  There are some kinds of data
that fit perfectly in CSV-style one-line-per-record flat files.  And
there are other kinds of data that just don't, which are better suited
to a hierarchical structured data format like XML.

i use both routinely, and it's really not that difficult to convert from
one to the other as needed by your task at hand.

where XML particularly shines is that it gives you the ability to say
"ah, just give me a dump of everything in XML format and i'll extract
what i need from that" instead of having to laboriously identify and
list exactly which fields you want and ask for just them, only to
find that you forgot one or more fields (or didn't know about - in my
experience, you often don't know what fields you need until AFTER you've
seen the data and if you only see a CSV-dump subset you'll never know
that the field you really want is available).

[1] xml2 - XML/Unix Processing Tools

http://dan.egnor.name/xml2/

from the man page for 'xml2':

NAME
       xml2 - convert xml documents in a flat format
       2xml - convert flat format into xml
       html2 - convert html documents in a flat format
       2html - convert flat format into html
       csv2 - convert csv files in a flat format
       2csv - convert flat format into csv

SYNOPSIS
       <xml|2xml|html2|2html|csv2|2csv> > outfile < infile

DESCRIPTION
       There are six tools.  None of them take any command-line
       arguments.  They are all simple filters which can be used to
       read files from standard input in one format and output it to
       standard output in another format.

       The flat format used by the tools is specific to these tools.
       It is a syntax for representing structured markup in a way
       that makes it easy to process with line-oriented tools.  The
       same format is used for HTML and XML; in fact, you can think
       of html2 as converting HTML to XHTML and running xml2 on the
       result; likewise 2html and 2xml.  (Of course, this isn't how the
       implementation works.)

similarly, there are tools for analysing an XML file (with or without
the associated DTD) and giving you a compact summary of what kinds of
records are in the file and what fields are available.  

IMO, the self-documenting nature of XML more than makes up for the
slight extra hassle of parsing/extracting from it, especially when that
extra hassle is mostly handled automatically by existing tools and
libraries.

craig

-- 
craig sanders <cas at taz.net.au>