Re: [ALUG] unicode, sqlite and C

3 Nov 2004


      MJ Ray mjr@dsl.pipex.com writes:
...
This is a typical slippery slope... I start off from a general issue
that affects most users, through a specific application, to a specific
programming question. Are you holding tight? Is there a C doctor in
the house?
Some of you may have "enjoyed" the change of character set from
ISO-8859-1 (an 8-bit character code, so 256 possible characters)) to
utf-8 (a large character set which is converted into 8-bit codes,
pairs of 8-bit codes and so on). Basically, 8859-1 only lets you
display western European text, while utf-8 lets you have southern or
eastern European languages, or greek or cyrillic or whatever, all at
once without doing anything unusual with character sets. FAQ at
http://www.cl.cam.ac.uk/~mgk25/unicode.html
The SQLite database seems to use UTF-16 as a basic datatype.
Not really, the ...16 functions are more complex than the UTF-8
versions and in some cases are wrappers around them.
...
I was having a browse after it was suggested that I try writing a
Scheme interface to it. When reading
http://www.sqlite.org/capi3.html, the following caught my eye:
"There is no agreement on what the C datatype for a UTF-16 string
should be."
Is there really such disagreement on this basic datatype? The FAQ
makes it look clearcut on wchar_t. What types do C programmers really
use?
In practice I've found using UTF-8 internally, and converting to the
current locale's encoding (or whatever other interfaces require) at
the boundaries, to be the most convenient approach.  Most of your
intuitions about string handling survive, you don't have to worry
about shift states, extracting the actually character code (if you
need it) is easy and efficient, etc.
wchar_t is platform dependent and locale-dependent, though I've no
idea if anyone is mad enough to make it actually differ from locale to
locale.  Linux uses UTF-32; AIUI Windows uses UTF-16.
The encoding of multibyte strings is similarly platform and locale
dependent and does vary between locales in reali life; this makes it
extremely inconvenient to actually do anything interesting with them
in a correct fashion.  You have to remember shift states, you can't
safely use strchr() or anything else that uses the same assumptions,
etc.
I've not personally tried to use UTF-16, but the combination of being
both variable length and non-byte-oriented sounds very inconvenient.
As for sqlite, sqlite 3 has UTF-8 and UTF-16 versions of functions.  I
can't imagine why you'd want the UTF-16 versions unless you were
committed to Windows.
-- 
http://www.greenend.org.uk/rjk/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: [ALUG] unicode, sqlite and C