Hi Folks,
I have encountered a weirdness in alphabetical sorting.
It originally popped up in an application, but experiment
reveals that it arises from the Linux UTF8 collation system.
First, check whether this may interest you by executing
locale
from the command line. If you see in the output:
LC_COLLATE="en_GB.UTF-8"
or similar, then read on.
When I run the following commands (which you can mouse-copy
into a concole for yourself) I get the results shown in the
lines starting with "#" (you can copy these in too -- they
will be ignored):
sort << EOT
"AACD"
"A CD"
EOT
# "AACD"
# "A CD"
sort << EOT
"ABCD"
"A CD"
EOT
# "ABCD"
# "A CD"
sort << EOT
"ACCD"
"A CD"
EOT
# "ACCD"
# "A CD"
sort << EOT
"ADCD"
"A CD"
EOT
# "A CD"
# "ADCD"
>From the above, it would seem that in en_GB.UTF-8 the
SPACE character " " is sorted to a position betwee "C" and "D",
since "AACD", "ABCD" and "ACCD" all sort prior to "A CD",
while "A CD" sorts prior to "ADCD"!
That strikes me as completely nuts! I would welcome any comment
about it. I have tried to track down where this is done, and
to locate on my system (Debian Lenny, also occurs on earlier
Debian Etch) any system file which defines the sort order
(i.e. collation order) of the standard ASCII (and other)
characters.
All help and/or insight much appreciated!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding(a)manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 28-May-10 Time: 20:57:48
------------------------------ XFMail ------------------------------