Namespaces
Variants
Views
Actions

Difference between revisions of "cpp/string/multibyte"

From cppreference.com
< cpp‎ | string
(+multibyte macros)
m (fmt)
 
(21 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 
{{title|Null-terminated multibyte strings}}
 
{{title|Null-terminated multibyte strings}}
{{cpp/string/multibyte/sidebar}}
+
{{cpp/string/multibyte/navbar}}
  
{{todo|clearly explain relation to NTBS (NTMBS is a subset of NTBS, etc.)}}
+
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).
  
A null-terminated multibyte string (NTMBS), or "multibyte string", is also a sequence of nonzero bytes followed by a byte with value zero (the terminating null character), but each character stored in the string may occupy more than one byte. For example, the char array {{cpp|{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}}} is an NTMBS holding the string {{cpp|"你好"}} in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好.
+
Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {{c|{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}<!---->}} is an NTMBS holding the string {{c|"你好"}} in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {{c|{'\xc4', '\xe3', '\xba', '\xc3', '\0'}<!---->}}, where each of the two characters is encoded as a two-byte sequence.
  
An NTMBS is only valid if it begins and ends in the initial shift state: if the string above began with {{cpp|'\xbd'}}, a byte that cannot appear in the initial shift state of UTF-8 (that is, it cannot be the first byte of a multibyte character), the sequence would not be an NTMBS. A multibyte character string is layout-compatible with byte string, that is, can be stored, copied, and examined using the same facilities, except for the length calculation. Multibyte strings can be converted to and from wide strings using appropriate conversion functions.
+
In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are the 7-bit JIS, BOCU-1 and [https://www.unicode.org/reports/tr6 SCSU].
 +
 
 +
A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the {{lc|std::codecvt}} member functions, {{lc|std::wstring_convert}}, or the following locale-dependent conversion functions:
  
 
===Multibyte/wide character conversions===
 
===Multibyte/wide character conversions===
{{dcl list begin}}
+
{{dsc begin}}
{{dcl list header | cstdlib}}
+
{{dsc header|cstdlib}}
{{dcl list fun | cpp/string/multibyte/mblen | returns the number of bytes in the next multibyte character}}
+
{{dsc inc|cpp/string/multibyte/dsc mblen}}
{{dcl list fun | cpp/string/multibyte/mbtowc | converts the next multibyte character to wide character}}
+
{{dsc inc|cpp/string/multibyte/dsc mbtowc}}
{{dcl list fun | cpp/string/multibyte/wctomb | converts a wide character to its multibyte representation}}
+
{{dsc inc|cpp/string/multibyte/dsc wctomb}}
{{dcl list fun | cpp/string/multibyte/mbstowcs | converts a narrow multibyte character string to wide string}}
+
{{dsc inc|cpp/string/multibyte/dsc mbstowcs}}
{{dcl list fun | cpp/string/multibyte/wcstombs | converts a wide string to narrow multibyte character string}}
+
{{dsc inc|cpp/string/multibyte/dsc wcstombs}}
{{dcl list header | cwchar}}
+
{{dsc header|cwchar}}
{{dcl list fun | cpp/string/multibyte/mbsinit | checks if the mbstate_t object represents initial shift state}}
+
{{dsc inc|cpp/string/multibyte/dsc mbsinit}}
{{dcl list fun | cpp/string/multibyte/btowc | widens a single-byte narrow character to wide character, if possible}}
+
{{dsc inc|cpp/string/multibyte/dsc btowc}}
{{dcl list fun | cpp/string/multibyte/wctob | narrows a wide character to a single-byte narrow character, if possible}}
+
{{dsc inc|cpp/string/multibyte/dsc wctob}}
{{dcl list fun | cpp/string/multibyte/mbrlen | returns the number of bytes in the next multibyte character, given state}}
+
{{dsc inc|cpp/string/multibyte/dsc mbrlen}}
{{dcl list fun | cpp/string/multibyte/mbrtowc | converts the next multibyte character to wide character, given state}}
+
{{dsc inc|cpp/string/multibyte/dsc mbrtowc}}
{{dcl list fun | cpp/string/multibyte/wcrtomb | converts a wide character to its multibyte representation, given state}}
+
{{dsc inc|cpp/string/multibyte/dsc wcrtomb}}
{{dcl list fun | cpp/string/multibyte/mbsrtowcs | converts a narrow multibyte character string to wide string, given state}}
+
{{dsc inc|cpp/string/multibyte/dsc mbsrtowcs}}
{{dcl list fun | cpp/string/multibyte/wcsrtombs | converts a wide string to narrow multibyte character string, given state}}
+
{{dsc inc|cpp/string/multibyte/dsc wcsrtombs}}
{{dcl list header | cuchar}}
+
{{dsc header|cuchar}}
{{dcl list fun | cpp/string/multibyte/mbrtoc16 | generate the next 16-bit wide character from a narrow multibyte string | notes={{mark c++11}} }}
+
{{dsc inc|cpp/string/multibyte/dsc mbrtoc8}}
{{dcl list fun | cpp/string/multibyte/c16rtombr | convert a 16-bit wide character to narrow multibyte string| notes={{mark c++11}} }}
+
{{dsc inc|cpp/string/multibyte/dsc c8rtomb}}
{{dcl list fun | cpp/string/multibyte/mbrtoc32 | generate the next 32-bit wide character from a narrow multibyte string| notes={{mark c++11}} }}
+
{{dsc inc|cpp/string/multibyte/dsc mbrtoc16}}
{{dcl list fun | cpp/string/multibyte/c32rtombr | convert a 32-bit wide character to narrow multibyte string| notes={{mark c++11}} }}
+
{{dsc inc|cpp/string/multibyte/dsc c16rtomb}}
{{dcl list end}}
+
{{dsc inc|cpp/string/multibyte/dsc mbrtoc32}}
 +
{{dsc inc|cpp/string/multibyte/dsc c32rtomb}}
 +
{{dsc end}}
 +
 
 +
===Types===
 +
{{dsc begin}}
 +
{{dsc header|cwchar}}
 +
{{dsc inc|cpp/string/multibyte/dsc mbstate_t}}
 +
{{dsc end}}
  
 
===Macros===
 
===Macros===
{{dcl list begin}}
+
{{dsc begin}}
{{dcl list header | climits}}
+
{{dsc header|climits}}
{{dcl list macro const | MB_LEN_MAX | nolink=true | maximum number of bytes in a multibyte character }}
+
{{dsc inc|cpp/string/multibyte/dsc MB_LEN_MAX}}
{{dcl list header | cstdlib}}
+
{{dsc header|cstdlib}}
{{dcl list macro const | MB_CUR_MAX | nolink=true | maximum number of bytes in a multibyte character in the current C locale}}
+
{{dsc inc|cpp/string/multibyte/dsc MB_CUR_MAX}}
{{dcl list header | cuchar}}
+
{{dsc header|cuchar}}
{{dcl list macro const | __STDC_UTF_16__ | nolink=true | indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb}}
+
{{dsc macro const|__STDC_UTF_16__|notes={{mark c++11}}|nolink=true|indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb}}
{{dcl list macro const | __STDC_UTF_32__ | nolink=true | indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb}}
+
{{dsc macro const|__STDC_UTF_32__|notes={{mark c++11}}|nolink=true|indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb}}
{{dcl list end}}
+
{{dsc end}}
 +
 
 +
===See also===
 +
{{dsc begin}}
 +
{{dsc see c|c/string/multibyte|Null-terminated multibyte strings}}
 +
{{dsc end}}
 +
 
 +
{{langlinks|de|es|fr|it|ja|pt|ru|zh}}

Latest revision as of 04:48, 26 December 2023

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).

Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'} is an NTMBS holding the string "你好" in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {'\xc4', '\xe3', '\xba', '\xc3', '\0'}, where each of the two characters is encoded as a two-byte sequence.

In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are the 7-bit JIS, BOCU-1 and SCSU.

A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the std::codecvt member functions, std::wstring_convert, or the following locale-dependent conversion functions:

Contents

[edit] Multibyte/wide character conversions

Defined in header <cstdlib>
returns the number of bytes in the next multibyte character
(function) [edit]
converts the next multibyte character to wide character
(function) [edit]
converts a wide character to its multibyte representation
(function) [edit]
converts a narrow multibyte character string to wide string
(function) [edit]
converts a wide string to narrow multibyte character string
(function) [edit]
Defined in header <cwchar>
checks if the std::mbstate_t object represents initial shift state
(function) [edit]
widens a single-byte narrow character to wide character, if possible
(function) [edit]
narrows a wide character to a single-byte narrow character, if possible
(function) [edit]
returns the number of bytes in the next multibyte character, given state
(function) [edit]
converts the next multibyte character to wide character, given state
(function) [edit]
converts a wide character to its multibyte representation, given state
(function) [edit]
converts a narrow multibyte character string to wide string, given state
(function) [edit]
converts a wide string to narrow multibyte character string, given state
(function) [edit]
Defined in header <cuchar>
(C++20)
converts a narrow multibyte character to UTF-8 encoding
(function) [edit]
(C++20)
converts UTF-8 string to narrow multibyte encoding
(function) [edit]
(C++11)
converts a narrow multibyte character to UTF-16 encoding
(function) [edit]
(C++11)
convert a 16-bit wide character to narrow multibyte string
(function) [edit]
(C++11)
converts a narrow multibyte character to UTF-32 encoding
(function) [edit]
(C++11)
convert a 32-bit wide character to narrow multibyte string
(function) [edit]

[edit] Types

Defined in header <cwchar>
conversion state information necessary to iterate multibyte character strings
(class) [edit]

[edit] Macros

Defined in header <climits>
MB_LEN_MAX
maximum number of bytes in a multibyte character
(macro constant) [edit]
Defined in header <cstdlib>
MB_CUR_MAX
maximum number of bytes in a multibyte character in the current C locale
(macro variable)[edit]
Defined in header <cuchar>
__STDC_UTF_16__
(C++11)
indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb
(macro constant)
__STDC_UTF_32__
(C++11)
indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb
(macro constant)

[edit] See also

C documentation for Null-terminated multibyte strings