Difference between revisions of "cpp/string/multibyte"

Latest revision as of 04:48, 26 December 2023

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).

Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'} is an NTMBS holding the string "你好" in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {'\xc4', '\xe3', '\xba', '\xc3', '\0'}, where each of the two characters is encoded as a two-byte sequence.

In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are the 7-bit JIS, BOCU-1 and SCSU.

A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the std::codecvt member functions, std::wstring_convert, or the following locale-dependent conversion functions:

Defined in header `<cstdlib>`
mblen	returns the number of bytes in the next multibyte character (function) [edit]
mbtowc	converts the next multibyte character to wide character (function) [edit]
wctomb	converts a wide character to its multibyte representation (function) [edit]
mbstowcs	converts a narrow multibyte character string to wide string (function) [edit]
wcstombs	converts a wide string to narrow multibyte character string (function) [edit]
Defined in header `<cwchar>`
mbsinit	checks if the std::mbstate_t object represents initial shift state (function) [edit]
btowc	widens a single-byte narrow character to wide character, if possible (function) [edit]
wctob	narrows a wide character to a single-byte narrow character, if possible (function) [edit]
mbrlen	returns the number of bytes in the next multibyte character, given state (function) [edit]
mbrtowc	converts the next multibyte character to wide character, given state (function) [edit]
wcrtomb	converts a wide character to its multibyte representation, given state (function) [edit]
mbsrtowcs	converts a narrow multibyte character string to wide string, given state (function) [edit]
wcsrtombs	converts a wide string to narrow multibyte character string, given state (function) [edit]
Defined in header `<cuchar>`
mbrtoc8 (C++20)	converts a narrow multibyte character to UTF-8 encoding (function) [edit]
c8rtomb (C++20)	converts UTF-8 string to narrow multibyte encoding (function) [edit]
mbrtoc16 (C++11)	converts a narrow multibyte character to UTF-16 encoding (function) [edit]
c16rtomb (C++11)	convert a 16-bit wide character to narrow multibyte string (function) [edit]
mbrtoc32 (C++11)	converts a narrow multibyte character to UTF-32 encoding (function) [edit]
c32rtomb (C++11)	convert a 32-bit wide character to narrow multibyte string (function) [edit]

[edit] Types

Defined in header `<cwchar>`
mbstate_t	conversion state information necessary to iterate multibyte character strings (class) [edit]

[edit] Macros

Defined in header `<climits>`
MB_LEN_MAX	maximum number of bytes in a multibyte character (macro constant) [edit]
Defined in header `<cstdlib>`
MB_CUR_MAX	maximum number of bytes in a multibyte character in the current C locale (macro variable)[edit]
Defined in header `<cuchar>`
__STDC_UTF_16__ (C++11)	indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb (macro constant)
__STDC_UTF_32__ (C++11)	indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb (macro constant)

[edit] See also

C documentation for Null-terminated multibyte strings

@@ Line 4: / Line 4: @@
 A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).
-Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {{c|{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}}} is an NTMBS holding the string {{c|"你好"}} in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {{c|{'\xc4', '\xe3', '\xba', '\xc3', '\0'}}}, where each of the two characters is encoded as a two-byte sequence.
+Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {{c|{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}<!---->}} is an NTMBS holding the string {{c|"你好"}} in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {{c|{'\xc4', '\xe3', '\xba', '\xc3', '\0'}<!---->}}, where each of the two characters is encoded as a two-byte sequence.
-In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are the 7-bit JIS, BOCU-1 and SCSU.
+In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are the 7-bit JIS, BOCU-1 and [https://www.unicode.org/reports/tr6 SCSU].
 A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the {{lc|std::codecvt}} member functions, {{lc|std::wstring_convert}}, or the following locale-dependent conversion functions:
@@ Line 12: / Line 12: @@
 ===Multibyte/wide character conversions===
 {{dsc begin}}
-{{dsc header | cstdlib}}
+{{dsc header|cstdlib}}
-{{dsc inc | cpp/string/multibyte/dsc mblen}}
+{{dsc inc|cpp/string/multibyte/dsc mblen}}
-{{dsc inc | cpp/string/multibyte/dsc mbtowc}}
+{{dsc inc|cpp/string/multibyte/dsc mbtowc}}
-{{dsc inc | cpp/string/multibyte/dsc wctomb}}
+{{dsc inc|cpp/string/multibyte/dsc wctomb}}
-{{dsc inc | cpp/string/multibyte/dsc mbstowcs}}
+{{dsc inc|cpp/string/multibyte/dsc mbstowcs}}
-{{dsc inc | cpp/string/multibyte/dsc wcstombs}}
+{{dsc inc|cpp/string/multibyte/dsc wcstombs}}
-{{dsc header | cwchar}}
+{{dsc header|cwchar}}
-{{dsc inc | cpp/string/multibyte/dsc mbsinit}}
+{{dsc inc|cpp/string/multibyte/dsc mbsinit}}
-{{dsc inc | cpp/string/multibyte/dsc btowc}}
+{{dsc inc|cpp/string/multibyte/dsc btowc}}
-{{dsc inc | cpp/string/multibyte/dsc wctob}}
+{{dsc inc|cpp/string/multibyte/dsc wctob}}
-{{dsc inc | cpp/string/multibyte/dsc mbrlen}}
+{{dsc inc|cpp/string/multibyte/dsc mbrlen}}
-{{dsc inc | cpp/string/multibyte/dsc mbrtowc}}
+{{dsc inc|cpp/string/multibyte/dsc mbrtowc}}
-{{dsc inc | cpp/string/multibyte/dsc wcrtomb}}
+{{dsc inc|cpp/string/multibyte/dsc wcrtomb}}
-{{dsc inc | cpp/string/multibyte/dsc mbsrtowcs}}
+{{dsc inc|cpp/string/multibyte/dsc mbsrtowcs}}
-{{dsc inc | cpp/string/multibyte/dsc wcsrtombs}}
+{{dsc inc|cpp/string/multibyte/dsc wcsrtombs}}
-{{dsc header | cuchar}}
+{{dsc header|cuchar}}
-{{dsc inc | cpp/string/multibyte/dsc mbrtoc16}}
+{{dsc inc|cpp/string/multibyte/dsc mbrtoc8}}
-{{dsc inc | cpp/string/multibyte/dsc c16rtomb}}
+{{dsc inc|cpp/string/multibyte/dsc c8rtomb}}
-{{dsc inc | cpp/string/multibyte/dsc mbrtoc32}}
+{{dsc inc|cpp/string/multibyte/dsc mbrtoc16}}
-{{dsc inc | cpp/string/multibyte/dsc c32rtomb}}
+{{dsc inc|cpp/string/multibyte/dsc c16rtomb}}
+{{dsc inc|cpp/string/multibyte/dsc mbrtoc32}}
+{{dsc inc|cpp/string/multibyte/dsc c32rtomb}}
 {{dsc end}}
 ===Types===
 {{dsc begin}}
-{{dsc header | cwchar}}
+{{dsc header|cwchar}}
-{{dsc inc | cpp/string/multibyte/dsc mbstate_t}}
+{{dsc inc|cpp/string/multibyte/dsc mbstate_t}}
 {{dsc end}}
 ===Macros===
 {{dsc begin}}
-{{dsc header | climits}}
+{{dsc header|climits}}
-{{dsc inc | cpp/string/multibyte/dsc MB_LEN_MAX}}
+{{dsc inc|cpp/string/multibyte/dsc MB_LEN_MAX}}
-{{dsc header | cstdlib}}
+{{dsc header|cstdlib}}
-{{dsc inc | cpp/string/multibyte/dsc MB_CUR_MAX}}
+{{dsc inc|cpp/string/multibyte/dsc MB_CUR_MAX}}
-{{dsc header | cuchar}}
+{{dsc header|cuchar}}
-{{dsc macro const | __STDC_UTF_16__ |  nolink=true | indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb}}
+{{dsc macro const|__STDC_UTF_16__|notes={{mark c++11}}|nolink=true|indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb}}
-{{dsc macro const | __STDC_UTF_32__ |  nolink=true | indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb}}
+{{dsc macro const|__STDC_UTF_32__|notes={{mark c++11}}|nolink=true|indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb}}
 {{dsc end}}
-[[de:cpp/string/multibyte]]
+===See also===
-[[es:cpp/string/multibyte]]
+{{dsc begin}}
-[[fr:cpp/string/multibyte]]
+{{dsc see c|c/string/multibyte|Null-terminated multibyte strings}}
-[[it:cpp/string/multibyte]]
+{{dsc end}}
-[[ja:cpp/string/multibyte]]
-[[pt:cpp/string/multibyte]]
+{{langlinks|de|es|fr|it|ja|pt|ru|zh}}
-[[ru:cpp/string/multibyte]]
-[[zh:cpp/string/multibyte]]

Compiler support
Freestanding and hosted
Language
Standard library
Standard library headers
Named requirements
Feature test macros (C++20)
Language support library
Concepts library (C++20)
Metaprogramming library (C++11)
Diagnostics library
General utilities library
Strings library
Containers library
Iterators library
Ranges library (C++20)
Algorithms library
Numerics library
Localizations library
Input/output library
Filesystem library (C++17)
Regular expressions library (C++11)
Concurrency support library (C++11)
Execution support library (C++26)
Technical specifications
Symbols index
External libraries

Null-terminated strings
Byte strings
Multibyte strings
Wide strings
Classes
basic_string
basic_string_view (C++17)
char_traits

cppreference.com

Namespaces

Variants

Views

Actions

Difference between revisions of "cpp/string/multibyte"

Latest revision as of 04:48, 26 December 2023

Contents

[edit] Multibyte/wide character conversions

[edit] Types

[edit] Macros

[edit] See also

Navigation

Toolbox