Difference between revisions of "c/string/multibyte"
m (+refs) |
|||
(9 intermediate revisions by 3 users not shown) | |||
Line 6: | Line 6: | ||
Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {{c|{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}}} is an NTMBS holding the string {{c|"你好"}} in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {{c|{'\xc4', '\xe3', '\xba', '\xc3', '\0'}}}, where each of the two characters is encoded as a two-byte sequence. | Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {{c|{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}}} is an NTMBS holding the string {{c|"你好"}} in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {{c|{'\xc4', '\xe3', '\xba', '\xc3', '\0'}}}, where each of the two characters is encoded as a two-byte sequence. | ||
− | In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and SCSU. | + | In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and [http://www.unicode.org/reports/tr6 SCSU]. |
A multibyte character string is layout-compatible with [[c/string/byte|null-terminated byte string]] (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the following locale-dependent conversion functions: | A multibyte character string is layout-compatible with [[c/string/byte|null-terminated byte string]] (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the following locale-dependent conversion functions: | ||
Line 28: | Line 28: | ||
{{dsc inc | c/string/multibyte/dsc wcsrtombs}} | {{dsc inc | c/string/multibyte/dsc wcsrtombs}} | ||
{{dsc header | uchar.h}} | {{dsc header | uchar.h}} | ||
+ | {{dsc inc | c/string/multibyte/dsc mbrtoc8}} | ||
+ | {{dsc inc | c/string/multibyte/dsc c8rtomb}} | ||
{{dsc inc | c/string/multibyte/dsc mbrtoc16}} | {{dsc inc | c/string/multibyte/dsc mbrtoc16}} | ||
{{dsc inc | c/string/multibyte/dsc c16rtomb}} | {{dsc inc | c/string/multibyte/dsc c16rtomb}} | ||
Line 39: | Line 41: | ||
{{dsc inc | c/string/multibyte/dsc mbstate_t}} | {{dsc inc | c/string/multibyte/dsc mbstate_t}} | ||
{{dsc header | uchar.h}} | {{dsc header | uchar.h}} | ||
− | {{dsc | + | {{dsc inc | c/string/multibyte/dsc char8_t}} |
− | {{dsc | + | {{dsc inc | c/string/multibyte/dsc char16_t}} |
+ | {{dsc inc | c/string/multibyte/dsc char32_t}} | ||
{{dsc end}} | {{dsc end}} | ||
Line 49: | Line 52: | ||
{{dsc header | stdlib.h}} | {{dsc header | stdlib.h}} | ||
{{dsc inc | c/string/multibyte/dsc MB_CUR_MAX}} | {{dsc inc | c/string/multibyte/dsc MB_CUR_MAX}} | ||
− | |||
− | |||
− | |||
{{dsc end}} | {{dsc end}} | ||
Line 61: | Line 61: | ||
{{ref std | section=7.29 | title=Extended multibyte and wide character utilities <wchar.h> | p=402-446}} | {{ref std | section=7.29 | title=Extended multibyte and wide character utilities <wchar.h> | p=402-446}} | ||
{{ref std | section=7.31.12 | title=General utilities <stdlib.h> | p=456}} | {{ref std | section=7.31.12 | title=General utilities <stdlib.h> | p=456}} | ||
− | {{ref std | section=7.31. | + | {{ref std | section=7.31.16 | title=Extended multibyte and wide character utilities <wchar.h> | p=456}} |
{{ref std | section=K.3.6 | title=General utilities <stdlib.h> | p=604-614}} | {{ref std | section=K.3.6 | title=General utilities <stdlib.h> | p=604-614}} | ||
{{ref std | section=K.3.9 | title=Extended multibyte and wide character utilities <wchar.h> | p=627-651}} | {{ref std | section=K.3.9 | title=Extended multibyte and wide character utilities <wchar.h> | p=627-651}} | ||
Line 71: | Line 71: | ||
{{ref std | section=7.26.12 | title=Extended multibyte and wide character utilities <wchar.h> | p=402}} | {{ref std | section=7.26.12 | title=Extended multibyte and wide character utilities <wchar.h> | p=402}} | ||
{{ref std c89}} | {{ref std c89}} | ||
− | {{ref std | section= | title=}} | + | {{ref std | section=4.1.4 | title=Limits <float.h> and <limits.h>}} |
+ | {{ref std | section=4.10 | title=GENERAL UTILITIES <stdlib.h>}} | ||
+ | {{ref std | section=4.13.7 | title=General utilities <stdlib.h>}} | ||
{{ref std end}} | {{ref std end}} | ||
Latest revision as of 17:51, 19 January 2023
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).
Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'} is an NTMBS holding the string "你好" in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {'\xc4', '\xe3', '\xba', '\xc3', '\0'}, where each of the two characters is encoded as a two-byte sequence.
In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and SCSU.
A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the following locale-dependent conversion functions:
Contents |
[edit] Multibyte/wide character conversions
Defined in header
<stdlib.h> | |
returns the number of bytes in the next multibyte character (function) | |
converts the next multibyte character to wide character (function) | |
(C11) |
converts a wide character to its multibyte representation (function) |
(C11) |
converts a narrow multibyte character string to wide string (function) |
(C11) |
converts a wide string to narrow multibyte character string (function) |
Defined in header
<wchar.h> | |
(C95) |
checks if the mbstate_t object represents initial shift state (function) |
(C95) |
widens a single-byte narrow character to wide character, if possible (function) |
(C95) |
narrows a wide character to a single-byte narrow character, if possible (function) |
(C95) |
returns the number of bytes in the next multibyte character, given state (function) |
(C95) |
converts the next multibyte character to wide character, given state (function) |
(C95)(C11) |
converts a wide character to its multibyte representation, given state (function) |
(C95)(C11) |
converts a narrow multibyte character string to wide string, given state (function) |
(C95)(C11) |
converts a wide string to narrow multibyte character string, given state (function) |
Defined in header
<uchar.h> | |
(C23) |
converts a narrow multibyte character to UTF-8 encoding (function) |
(C23) |
converts UTF-8 string to narrow multibyte encoding (function) |
(C11) |
generates the next 16-bit wide character from a narrow multibyte string (function) |
(C11) |
converts a 16-bit wide character to narrow multibyte string (function) |
(C11) |
generates the next 32-bit wide character from a narrow multibyte string (function) |
(C11) |
converts a 32-bit wide character to narrow multibyte string (function) |
[edit] Types
Defined in header
<wchar.h> | |
(C95) |
conversion state information necessary to iterate multibyte character strings (class) |
Defined in header
<uchar.h> | |
(C23) |
UTF-8 character type, an alias for unsigned char (typedef) |
(C11) |
16-bit wide character type (typedef) |
(C11) |
32-bit wide character type (typedef) |
[edit] Macros
Defined in header
<limits.h> | |
MB_LEN_MAX |
maximum number of bytes in a multibyte character, for any supported locale (macro constant) |
Defined in header
<stdlib.h> | |
MB_CUR_MAX |
maximum number of bytes in a multibyte character, in the current locale (macro variable) |
[edit] References
- C11 standard (ISO/IEC 9899:2011):
- 7.10 Sizes of integer types <limits.h> (p: 222)
- 7.22 General utilities <stdlib.h> (p: 340-360)
- 7.28 Unicode utilities <uchar.h> (p: 398-401)
- 7.29 Extended multibyte and wide character utilities <wchar.h> (p: 402-446)
- 7.31.12 General utilities <stdlib.h> (p: 456)
- 7.31.16 Extended multibyte and wide character utilities <wchar.h> (p: 456)
- K.3.6 General utilities <stdlib.h> (p: 604-614)
- K.3.9 Extended multibyte and wide character utilities <wchar.h> (p: 627-651)
- C99 standard (ISO/IEC 9899:1999):
- 7.10 Sizes of integer types <limits.h> (p: 203)
- 7.20 General utilities <stdlib.h> (p: 306-324)
- 7.24 Extended multibyte and wide character utilities <wchar.h> (p: 348-392)
- 7.26.10 General utilities <stdlib.h> (p: 402)
- 7.26.12 Extended multibyte and wide character utilities <wchar.h> (p: 402)
- C89/C90 standard (ISO/IEC 9899:1990):
- 4.1.4 Limits <float.h> and <limits.h>
- 4.10 GENERAL UTILITIES <stdlib.h>
- 4.13.7 General utilities <stdlib.h>
[edit] See also
C++ documentation for Null-terminated multibyte strings
|