[ALUG] Alphabetical sorting in EN-UTF8

28 May 2010


      Hi Folks,
I have encountered a weirdness in alphabetical sorting.
It originally popped up in an application, but experiment
reveals that it arises from the Linux UTF8 collation system.
First, check whether this may interest you by executing
locale
from the command line. If you see in the output:
  LC_COLLATE="en_GB.UTF-8"
or similar, then read on.
When I run the following commands (which you can mouse-copy
into a concole for yourself) I get the results shown in the
lines starting with "#" (you can copy these in too -- they
will be ignored):
sort << EOT
"AACD"
"A CD"
EOT
# "AACD"
# "A CD"
sort << EOT
"ABCD"
"A CD"
EOT
# "ABCD"
# "A CD"
sort << EOT
"ACCD"
"A CD"
EOT
# "ACCD"
# "A CD"
sort << EOT
"ADCD"
"A CD"
EOT
# "A CD"
# "ADCD"
...
From the above, it would seem that in en_GB.UTF-8 the
SPACE character " " is sorted to a position betwee "C" and "D",
since "AACD", "ABCD" and "ACCD" all sort prior to "A CD",
while "A CD" sorts prior to "ADCD"!
That strikes me as completely nuts! I would welcome any comment
about it. I have tried to track down where this is done, and
to locate on my system (Debian Lenny, also occurs on earlier
Debian Etch) any system file which defines the sort order
(i.e. collation order) of the standard ASCII (and other)
characters.
All help and/or insight much appreciated!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk
Fax-to-email: +44 (0)870 094 0861
Date: 28-May-10                                       Time: 20:57:48
------------------------------ XFMail ------------------------------

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

[ALUG] Alphabetical sorting in EN-UTF8