On Wed, 20 Jul 2005, Klaus Schmidinger (KS) wrote:
I think the confusion comes from the assumption that a character is exactly one byte long.
strlen counts bytes not characters. in utf-8 a character can be up to 4 (or was it 8) bytes long.
IIRC, there are new functions to count characters (wstrlen, wstrcmp, etc.)
Aren't you confusing this with "wide character" functions?
yes, I am talking about wide characters. I don't think I am confusing anything (correct me if I'm wrong)
from glibc manual:
Introduction to Extended Characters
A variety of solutions is available to overcome the differences between character sets with a 1:1 relation between bytes and characters and character sets with ratios of 2:1 or 4:1. [...]
As shown in some other part of this manual, a completely new family has been created of functions that can handle wide character texts in memory. The most commonly used character sets for such internal wide character representations are Unicode and ISO 10646 [...] Unicode was originally planned as a 16-bit character set; whereas, ISO 10646 was designed to be a 31-bit large code space. [...]
UTF-8 is an ASCII compatible encoding where ASCII characters are represented by ASCII bytes and non-ASCII characters by sequences of 2-6 non-ASCII bytes [...]
To represent wide characters the char type is not suitable. For this reason the ISO C standard introduces [...] wchar_t, [...]
Sergei