[Nauty] Warning about LOCALEs in Unix/Linux

Brendan McKay bdm at cs.anu.edu.au
Tue Feb 3 23:18:01 EST 2004


Recent versions of the Unix/Linux programs sort, comm, and maybe others
use the collation locale to determine the ordering of the characters.
This is true of the nauty program shortg also, because it uses sort in
the background.

The collation locale is kept in the environment variable LC_COLLATE,
or if that doesn't exist, in the environment variable LANG.

The problem is that a locale need not specify a complete ordering of
all possible characters.  This seems to be the case for the locale
"en_US.UTF-8" that is common as default these days.  Other locales
probably have the same property.  It means that sort does not necessarily
even place equal lines together in the output, which is just what we
DON'T need when we sort a file of graphs.

The best way to avoid this problem is to use the collation locale "C".
This defines the ordering of characters to be the same as in ASCII.
It has the advantage that the ordering used by sort will be the same
as you get in a program if you compare characters using <,> or library
functions like strcmp().

To set the collation locale to "C":

   setenv LC_COLLATE C               # csh, tcsh, and similar
   export LC_COLLATE=C               # bash, ash, bsh
   LC_COLLATE=C; export LC_COLLATE   # Bourne shell

You could set the variable LANG instead, provided LC_COLLATE is undefined,
but that will also affect things like LC_MESSAGES which tells the system
which language to write error messages in.

MacOSX is the same.  I don't know what the situation is on machines
other than these, such as Cygwin on Windows.  Can someone report?

Can anyone think of a good reason I shouldn't modify shortg to always
sort using the "C" collation locale?  It would mean that output from
shortg might not be in the same order as output from labelg piped through
sort, but that seems a lesser sin than getting the wrong answer.

Brendan.




More information about the Nauty mailing list