Hi Folks, I have encountered a weirdness in alphabetical sorting. It originally popped up in an application, but experiment reveals that it arises from the Linux UTF8 collation system.
First, check whether this may interest you by executing
locale
from the command line. If you see in the output: LC_COLLATE="en_GB.UTF-8" or similar, then read on.
When I run the following commands (which you can mouse-copy into a concole for yourself) I get the results shown in the lines starting with "#" (you can copy these in too -- they will be ignored):
sort << EOT "AACD" "A CD" EOT # "AACD" # "A CD"
sort << EOT "ABCD" "A CD" EOT # "ABCD" # "A CD"
sort << EOT "ACCD" "A CD" EOT # "ACCD" # "A CD"
sort << EOT "ADCD" "A CD" EOT # "A CD" # "ADCD"
From the above, it would seem that in en_GB.UTF-8 the
SPACE character " " is sorted to a position betwee "C" and "D", since "AACD", "ABCD" and "ACCD" all sort prior to "A CD", while "A CD" sorts prior to "ADCD"!
That strikes me as completely nuts! I would welcome any comment about it. I have tried to track down where this is done, and to locate on my system (Debian Lenny, also occurs on earlier Debian Etch) any system file which defines the sort order (i.e. collation order) of the standard ASCII (and other) characters.
All help and/or insight much appreciated! Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 28-May-10 Time: 20:57:48 ------------------------------ XFMail ------------------------------