On Wednesday 20 July 2005 17:52, Sergei Haller wrote:
On Wed, 20 Jul 2005, Ludwig Nussel (LN) wrote:
Klaus Schmidinger wrote:
[...] To me, a character is an entity that's always the same size (preferably one byte). UTF-8 breaks with this, so if you have a string that has, e.g. a strlen() of 10, you can't be sure that this will be really 10 printing characters because there might be some "escaped" characters.
I think the confusion comes from the assumption that a character is exactly one byte long.
strlen counts bytes not characters.
in utf-8 a character can be up to 4 (or was it 8) bytes long.
Correct. The "ascii 7 bit" is one byte, everything else needs escape characters, e.g. German umlauts are 2 bytes each.
IIRC, there are new functions to count characters (wstrlen, wstrcmp, etc.)
Wrong. This is for wide characters, where every character uses 2 or 4 bytes.
In fact IF you want to support unicode in an application, you are better off making your application use wide characters inside (wchar_t), and make all external interfaces use UTF-8 (e.g. file input/output).
Using UTF-8 inside an application gets tricky, as you cannot use strlen to count the characters, for example.
Kind regards, Stefan