Namespaces
Variants
Views
Actions

Difference between revisions of "c/string/multibyte"

From cppreference.com
< c‎ | string
(copy paste)
 
(Multibyte/wide character conversions: not in C99)
Line 27: Line 27:
 
{{dcl list template | c/string/multibyte/dcl list mbsrtowcs}}
 
{{dcl list template | c/string/multibyte/dcl list mbsrtowcs}}
 
{{dcl list template | c/string/multibyte/dcl list wcsrtombs}}
 
{{dcl list template | c/string/multibyte/dcl list wcsrtombs}}
{{dcl list header | uchar.h}}
 
{{dcl list template | c/string/multibyte/dcl list mbrtoc16}}
 
{{dcl list template | c/string/multibyte/dcl list c16rtomb}}
 
{{dcl list template | c/string/multibyte/dcl list mbrtoc32}}
 
{{dcl list template | c/string/multibyte/dcl list c32rtomb}}
 
 
{{dcl list end}}
 
{{dcl list end}}
  

Revision as of 04:59, 13 April 2012

Template:c/string/multibyte/sidebar

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).

Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array Template:cpp} is an NTMBS holding the string Template:cpp in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array Template:cpp}, where each of the two characters is encoded as a two-byte sequence.

In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and SCSU.

A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the Template:cpp member functions, Template:cpp, or the following locale-dependent conversion functions:

Multibyte/wide character conversions

Template:c/string/multibyte/dcl list mblenTemplate:c/string/multibyte/dcl list mbtowcTemplate:c/string/multibyte/dcl list wctombTemplate:c/string/multibyte/dcl list mbstowcsTemplate:c/string/multibyte/dcl list wcstombsTemplate:c/string/multibyte/dcl list mbsinitTemplate:c/string/multibyte/dcl list btowcTemplate:c/string/multibyte/dcl list wctobTemplate:c/string/multibyte/dcl list mbrlenTemplate:c/string/multibyte/dcl list mbrtowcTemplate:c/string/multibyte/dcl list wcrtombTemplate:c/string/multibyte/dcl list mbsrtowcsTemplate:c/string/multibyte/dcl list wcsrtombs
Defined in header <stdlib.h>
Defined in header <wchar.h>

Types

Template:c/string/multibyte/dcl list mbstate tTemplate:tdcl list end

Macros

Defined in header <wchar.h>
Template:c/string/multibyte/dcl list MB LEN MAXTemplate:c/string/multibyte/dcl list MB CUR MAX
Defined in header <limits.h>
Defined in header <stdlib.h>
Defined in header <uchar.h>
__STDC_UTF_16__
indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb
(macro constant)
__STDC_UTF_32__
indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb
(macro constant)