Namespaces
Variants
Views
Actions

Difference between revisions of "cpp/language/charset"

From cppreference.com
< cpp‎ | language
m (Tiny fix.)
m (Execution character set: c/core.)
 
(18 intermediate revisions by 8 users not shown)
Line 2: Line 2:
 
{{cpp/language/basics/navbar}}
 
{{cpp/language/basics/navbar}}
  
===Current character sets {{mark since c++23}}===
+
This page describes several character sets specified by the C++ standard.
  
====Translation character set====
+
{{rrev|since=c++23|
 +
===Translation character set===
 
The ''translation character set'' consists of the following elements:
 
The ''translation character set'' consists of the following elements:
* each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
+
* each abstract character assigned a code point in the [https://www.unicode.org/versions/latest/ Unicode] codespace, and
* a distinct character for each UCS scalar value where no named character is assigned.
+
* a distinct character for each Unicode scalar value not assigned to an abstract character.
  
====Basic character set====
+
The translation character set is a superset of the basic character set and the basic literal character set (see below).
The ''basic character set'' is a subset of the translation character set, consisting of the following 96 characters:
+
}}
{{cpp/language/basic_charset}}
+
  
====Basic literal character set====
+
===Basic character set===
 +
{{anchor|Basic source character set}}
 +
The ''basic character set'' consists of the following {{rev inl|until=c++26|96}}{{rev inl|since=c++26|99}} characters:
 +
{{cpp/language/basic charset}}
 +
{{rrev|since=c++26|
 +
The following characters are added to the basic character set since C++26:
 +
{{cpp/language/ext_charset_single}}
 +
}}
 +
 
 +
===Basic literal character set===
 +
{{anchor|Basic execution character set}}
 
The ''basic literal character set'' consists of all characters of the basic character set, plus the following control characters:
 
The ''basic literal character set'' consists of all characters of the basic character set, plus the following control characters:
{| class="wikitable" style="text-align: left;"
+
{|class="wikitable" style="text-align: left;"
 
|-  
 
|-  
! {{tt|Code unit}} || {{tt|Character}}
+
!Code unit||Character
 
|-
 
|-
 
|U+0000||Null
 
|U+0000||Null
Line 28: Line 38:
 
|}
 
|}
  
====Execution character set====
+
===Execution character set===
 
The execution character set and the execution wide-character set are supersets of the basic literal
 
The execution character set and the execution wide-character set are supersets of the basic literal
 
character set. The encodings of the execution character sets and the sets of additional elements
 
character set. The encodings of the execution character sets and the sets of additional elements
(if any) are locale-specific.
+
(if any) are locale-specific. Each element of execution wide-character set must be representable as a distinct {{c/core|wchar_t}} code unit.
  
====Code unit and literal encoding====
+
===Code unit and literal encoding===
 
A ''code unit'' is an integer value of character type. Characters in a {{rlp|character literal}} other than a multicharacter or non-encodable character literal or in a {{rlp|string literal}} are encoded as a sequence of one or more code units, as determined by the encoding prefix; this is termed the respective ''literal encoding''.
 
A ''code unit'' is an integer value of character type. Characters in a {{rlp|character literal}} other than a multicharacter or non-encodable character literal or in a {{rlp|string literal}} are encoded as a sequence of one or more code units, as determined by the encoding prefix; this is termed the respective ''literal encoding''.
  
Line 45: Line 55:
 
For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.
 
For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.
  
===Pre-C++23 character sets {{mark until c++23}}===
+
===Notes===
 +
The standard names of some character sets are changed in C++23 via {{wg21|P2314R4}}.
  
====Basic source character set====
+
{|class="wikitable" style="text-align: left;"
The ''basic source character set'' consists of 96 characters:
+
|-  
* the space character,
+
!New name(s)||Old name(s)
* the control characters representing
+
** horizontal tab,
+
** vertical tab,
+
** form feed,
+
** and new-line,
+
* plus the following 91 graphical characters:
+
{| class="wikitable" style="text-align: center;"
+
|-
+
|{{tt|a}}||{{tt|b}}||{{tt|c}}||{{tt|d}}||{{tt|e}}||{{tt|f}}||{{tt|g}}||{{tt|h}}||{{tt|i}}||{{tt|j}}||{{tt|k}}||{{tt|l}}||{{tt|m}}||{{tt|n}}||{{tt|o}}||{{tt|p}}||{{tt|q}}||{{tt|r}}||{{tt|s}}||{{tt|t}}||{{tt|u}}||{{tt|v}}||{{tt|w}}||{{tt|x}}||{{tt|y}}||{{tt|z}}
+
|-
+
|{{tt|A}}||{{tt|B}}||{{tt|C}}||{{tt|D}}||{{tt|E}}||{{tt|F}}||{{tt|G}}||{{tt|H}}||{{tt|I}}||{{tt|J}}||{{tt|K}}||{{tt|L}}||{{tt|M}}||{{tt|N}}||{{tt|O}}||{{tt|P}}||{{tt|Q}}||{{tt|R}}||{{tt|S}}||{{tt|T}}||{{tt|U}}||{{tt|V}}||{{tt|W}}||{{tt|X}}||{{tt|Y}}||{{tt|Z}}
+
 
|-
 
|-
|{{tt|0}}||{{tt|1}}||{{tt|2}}||{{tt|3}}||{{tt|4}}||{{tt|5}}||{{tt|6}}||{{tt|7}}||{{tt|8}}||{{tt|9}}
+
|basic character set||basic source character set
 
|-
 
|-
|{{tt|_}}||{{tt|{}}||{{tt|}}}||{{tt|[}}||{{tt|]}}||{{tt|#}}||{{tt|(}}||{{tt|)}}||{{tt|<}}||{{tt|>}}||{{tt|%}}||{{tt|:}}||{{tt|;}}||{{tt|.}}||{{tt|?}}||{{tt|*}}||{{tt|+}}||{{tt|-}}||{{tt|/}}||{{tt|^}}||{{tt|&}}||{{tt|&#124;}}||{{tt|~}}||{{tt|!}}||{{tt|{{=}}}}||{{tt|,}}||{{tt|\}}||{{tt|"}}||{{tt|’}}
+
|basic literal character set||basic execution character set<br>basic execution wide-character set
 
|}
 
|}
  
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in {{rlp|translation phases#Phase1|translation phase 1}}) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
+
Mapping from source file {{rev inl|since=c++23|(other than a UTF-8 source file)}} characters to the {{rev inl|until=c++23|basic character set}}{{rev inl|since=c++23|translation character set}} during {{rlp|translation phases#Phase1|translation phase 1}} is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.
  
====Basic execution character set====
+
===Defect reports===
The ''basic execution character set'' and the ''basic execution wide-character set'' shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0.
+
{{dr list begin}}
 +
{{dr list item|wg=cwg|dr=788|std=C++98|before=the values of the members of the execution character sets<br>were implementation-defined, but were not locale-specific|after=they are locale-specific}}
 +
{{dr list item|wg=cwg|dr=1796|std=C++98|before=the representation of the null (wide) character in<br>basic execution (wide-)character set had all zero bits|after=only required value to be zero}}
 +
{{dr list end}}
  
For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
+
===See also===
 +
{{dsc begin}}
 +
{{dsc|{{rlp|ascii|ASCII chart}}}}
 +
{{dsc inc|cpp/locale/dsc text_encoding}}
 +
{{dsc see c|c/language/charset|Character sets and encodings|nomono=true}}
 +
{{dsc end}}
  
====Execution character set (Old definition)====
+
{{langlinks|es|ja|ru|zh}}
The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.
+

Latest revision as of 19:49, 28 July 2024

 
 
C++ language
General topics
Flow control
Conditional execution statements
if
Iteration statements (loops)
for
range-for (C++11)
Jump statements
Functions
Function declaration
Lambda function expression
inline specifier
Dynamic exception specifications (until C++17*)
noexcept specifier (C++11)
Exceptions
Namespaces
Types
Specifiers
const/volatile
decltype (C++11)
auto (C++11)
constexpr (C++11)
consteval (C++20)
constinit (C++20)
Storage duration specifiers
Initialization
Expressions
Alternative representations
Literals
Boolean - Integer - Floating-point
Character - String - nullptr (C++11)
User-defined (C++11)
Utilities
Attributes (C++11)
Types
typedef declaration
Type alias declaration (C++11)
Casts
Memory allocation
Classes
Class-specific function properties
explicit (C++11)
static

Special member functions
Templates
Miscellaneous
 
 

This page describes several character sets specified by the C++ standard.

Contents

Translation character set

The translation character set consists of the following elements:

  • each abstract character assigned a code point in the Unicode codespace, and
  • a distinct character for each Unicode scalar value not assigned to an abstract character.

The translation character set is a superset of the basic character set and the basic literal character set (see below).

(since C++23)

[edit] Basic character set

The basic character set consists of the following 96(until C++26)99(since C++26) characters:

Code unit Character Glyph
U+0009 Character tabulation
U+000B Line tabulation
U+000C Form feed (FF)
U+0020 Space
U+000A Line feed (LF) new-line
U+0021 Exclamation mark !
U+0022 Quotation mark "
U+0023 Number sign #
U+0025 Percent sign %
U+0026 Ampersand &
U+0027 Apostrophe '
U+0028 Left parenthesis (
U+0029 Right parenthesis )
U+002A Asterisk *
U+002B Plus sign +
U+002C Comma ,
U+002D Hyphen-minus -
U+002E Full stop .
U+002F Solidus /
U+0030 .. U+0039 Digit zero .. nine 0 1 2 3 4 5 6 7 8 9
U+003A Colon :
U+003B Semicolon ;
U+003C Less-than sign <
U+003D Equals sign =
U+003E Greater-than sign >
U+003F Question mark ?
U+0041 .. U+005A Latin capital letter A .. Z A B C D E F G H I J K L M

N O P Q R S T U V W X Y Z

U+005B Left square bracket [
U+005C Reverse solidus \
U+005D Right square bracket ]
U+005E Circumflex accent ^
U+005F Low line _
U+0061 .. U+007A Latin small letter a .. z a b c d e f g h i j k l m

n o p q r s t u v w x y z

U+007B Left curly bracket {
U+007C Vertical line |
U+007D Right curly bracket }
U+007E Tilde ~

The following characters are added to the basic character set since C++26:

Code unit Character Glyph
U+0024 Dollar Sign $
U+0040 Commercial At @
U+0060 Grave Accent `
(since C++26)

[edit] Basic literal character set

The basic literal character set consists of all characters of the basic character set, plus the following control characters:

Code unit Character
U+0000 Null
U+0007 Bell
U+0008 Backspace
U+000D Carriage return (CR)

[edit] Execution character set

The execution character set and the execution wide-character set are supersets of the basic literal character set. The encodings of the execution character sets and the sets of additional elements (if any) are locale-specific. Each element of execution wide-character set must be representable as a distinct wchar_t code unit.

[edit] Code unit and literal encoding

A code unit is an integer value of character type. Characters in a character literal other than a multicharacter or non-encodable character literal or in a string literal are encoded as a sequence of one or more code units, as determined by the encoding prefix; this is termed the respective literal encoding.

A literal encoding or a locale-specific encoding of one of the execution character sets encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element. A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. The encodings of the execution character sets can be unrelated to any literal encoding.

The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.

The U+0000 NULL character is encoded as the value 0. No other element of the translation character set is encoded with a code unit of value 0. The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous. The ordinary and wide literal encodings are otherwise implementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.

[edit] Notes

The standard names of some character sets are changed in C++23 via P2314R4.

New name(s) Old name(s)
basic character set basic source character set
basic literal character set basic execution character set
basic execution wide-character set

Mapping from source file (other than a UTF-8 source file)(since C++23) characters to the basic character set(until C++23)translation character set(since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.

[edit] Defect reports

The following behavior-changing defect reports were applied retroactively to previously published C++ standards.

DR Applied to Behavior as published Correct behavior
CWG 788 C++98 the values of the members of the execution character sets
were implementation-defined, but were not locale-specific
they are locale-specific
CWG 1796 C++98 the representation of the null (wide) character in
basic execution (wide-)character set had all zero bits
only required value to be zero

[edit] See also

ASCII chart
describes an interface for accessing the IANA Character Sets registry
(class) [edit]
C documentation for Character sets and encodings