Hi Folks, I have encountered a weirdness in alphabetical sorting. It originally popped up in an application, but experiment reveals that it arises from the Linux UTF8 collation system.
First, check whether this may interest you by executing
locale
from the command line. If you see in the output: LC_COLLATE="en_GB.UTF-8" or similar, then read on.
When I run the following commands (which you can mouse-copy into a concole for yourself) I get the results shown in the lines starting with "#" (you can copy these in too -- they will be ignored):
sort << EOT "AACD" "A CD" EOT # "AACD" # "A CD"
sort << EOT "ABCD" "A CD" EOT # "ABCD" # "A CD"
sort << EOT "ACCD" "A CD" EOT # "ACCD" # "A CD"
sort << EOT "ADCD" "A CD" EOT # "A CD" # "ADCD"
From the above, it would seem that in en_GB.UTF-8 the
SPACE character " " is sorted to a position betwee "C" and "D", since "AACD", "ABCD" and "ACCD" all sort prior to "A CD", while "A CD" sorts prior to "ADCD"!
That strikes me as completely nuts! I would welcome any comment about it. I have tried to track down where this is done, and to locate on my system (Debian Lenny, also occurs on earlier Debian Etch) any system file which defines the sort order (i.e. collation order) of the standard ASCII (and other) characters.
All help and/or insight much appreciated! Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 28-May-10 Time: 20:57:48 ------------------------------ XFMail ------------------------------
On Fri, May 28, 2010 at 08:57:51PM +0100, Ted Harding wrote:
That strikes me as completely nuts! I would welcome any comment about it. I have tried to track down where this is done, and to locate on my system (Debian Lenny, also occurs on earlier Debian Etch) any system file which defines the sort order (i.e. collation order) of the standard ASCII (and other) characters.
All help and/or insight much appreciated!
man 5 locale should be a starting point.
Adam
On 28-May-10 20:37:39, Adam Bower wrote:
On Fri, May 28, 2010 at 08:57:51PM +0100, Ted Harding wrote:
That strikes me as completely nuts! I would welcome any comment about it. I have tried to track down where this is done, and to locate on my system (Debian Lenny, also occurs on earlier Debian Etch) any system file which defines the sort order (i.e. collation order) of the standard ASCII (and other) characters.
All help and/or insight much appreciated!
man 5 locale should be a starting point.
Adam
Thanks Adam. Thanks also to some folk on the Linux-Users list, whose hints led to realising that the problem with
sort << EOT "ABCD" "A CD" EOT # "ABCD" # "A CD"
sort << EOT "ADCD" "A CD" EOT # "A CD" # "ADCD"
arises because, by default, the " " is ignored in sorting. Therefore in the first case it sorted "ABCD" and "ACD", returning "ABCD", "A CD", while in the second case it sorted "ADCD" and "ACD", and returned "A CD", "ADCD".
A solution is to export LC_COLLATE=C -- following directly on after the above:
export LC_COLLATE=C
sort << EOT "ABCD" "A CD" EOT # "A CD" # "ABCD"
sort << EOT "ADCD" "A CD" EOT # "A CD" # "ADCD"
Because "export" makes LC_COLLATE available to processes spawned by the shell, this also works within the application that revealed the problem in the first place.
Cheers, Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 28-May-10 Time: 23:10:18 ------------------------------ XFMail ------------------------------