On 3/11/2004, "MJ Ray" mjr@dsl.pipex.com wrote:
This is a typical slippery slope... I start off from a general issue that affects most users, through a specific application, to a specific programming question. Are you holding tight? Is there a C doctor in the house?
Some of you may have "enjoyed" the change of character set from ISO-8859-1 (an 8-bit character code, so 256 possible characters)) to utf-8 (a large character set which is converted into 8-bit codes, pairs of 8-bit codes and so on). Basically, 8859-1 only lets you display western European text, while utf-8 lets you have southern or eastern European languages, or greek or cyrillic or whatever, all at once without doing anything unusual with character sets. FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html
The SQLite database seems to use UTF-16 as a basic datatype. I was having a browse after it was suggested that I try writing a Scheme interface to it. When reading http://www.sqlite.org/capi3.html, the following caught my eye: "There is no agreement on what the C datatype for a UTF-16 string should be."
Is there really such disagreement on this basic datatype? The FAQ makes it look clearcut on wchar_t. What types do C programmers really use?
For my part, Scheme's character and string datatypes seem to cope with unicode in theory, but the implementation details (such as what character set) are still being thrashed out.
One word - Java...
Unicode capability and built in character conversions for everyhing from ISO-8856-1, through UTF-8 and UTF-16 to things like BIG-5 for Chinese.
It's one of the things that was designed in from the beginning. There's a reason that "use the right tool for the job" became such a popular saying...
Matt