MJ Ray mjr@dsl.pipex.com writes:
This is a typical slippery slope... I start off from a general issue that affects most users, through a specific application, to a specific programming question. Are you holding tight? Is there a C doctor in the house?
Some of you may have "enjoyed" the change of character set from ISO-8859-1 (an 8-bit character code, so 256 possible characters)) to utf-8 (a large character set which is converted into 8-bit codes, pairs of 8-bit codes and so on). Basically, 8859-1 only lets you display western European text, while utf-8 lets you have southern or eastern European languages, or greek or cyrillic or whatever, all at once without doing anything unusual with character sets. FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html
The SQLite database seems to use UTF-16 as a basic datatype.
Not really, the ...16 functions are more complex than the UTF-8 versions and in some cases are wrappers around them.
I was having a browse after it was suggested that I try writing a Scheme interface to it. When reading http://www.sqlite.org/capi3.html, the following caught my eye: "There is no agreement on what the C datatype for a UTF-16 string should be."
Is there really such disagreement on this basic datatype? The FAQ makes it look clearcut on wchar_t. What types do C programmers really use?
In practice I've found using UTF-8 internally, and converting to the current locale's encoding (or whatever other interfaces require) at the boundaries, to be the most convenient approach. Most of your intuitions about string handling survive, you don't have to worry about shift states, extracting the actually character code (if you need it) is easy and efficient, etc.
wchar_t is platform dependent and locale-dependent, though I've no idea if anyone is mad enough to make it actually differ from locale to locale. Linux uses UTF-32; AIUI Windows uses UTF-16.
The encoding of multibyte strings is similarly platform and locale dependent and does vary between locales in reali life; this makes it extremely inconvenient to actually do anything interesting with them in a correct fashion. You have to remember shift states, you can't safely use strchr() or anything else that uses the same assumptions, etc.
I've not personally tried to use UTF-16, but the combination of being both variable length and non-byte-oriented sounds very inconvenient.
As for sqlite, sqlite 3 has UTF-8 and UTF-16 versions of functions. I can't imagine why you'd want the UTF-16 versions unless you were committed to Windows.