Difference between revisions of "cpp/language/charset"

Revision as of 17:52, 27 September 2022

Code unit	Character	Glyph
U+0009	Character tabulation
U+000B	Line tabulation
U+000C	Form feed (FF)
U+0020	Space
U+000A	Line feed (LF)	new-line
U+0021	Exclamation mark	`!`
U+0022	Quotation mark	`"`
U+0023	Number sign	`#`
U+0025	Percent sign	`%`
U+0026	Ampersand	`&`
U+0027	Apostrophe	`'`
U+0028	Left parenthesis	`(`
U+0029	Right parenthesis	`)`
U+002A	Asterisk	`*`
U+002B	Plus sign	`+`
U+002C	Comma	`,`
U+002D	Hyphen-minus	`-`
U+002E	Full stop	`.`
U+002F	Solidus	`/`
U+0030 .. U+0039	Digit zero .. nine	`0 1 2 3 4 5 6 7 8 9`
U+003A	Colon	`:`
U+003B	Semicolon	`;`
U+003C	Less-than sign	`<`
U+003D	Equals sign	`=`
U+003E	Greater-than sign	`>`
U+003F	Question mark	`?`
U+0041 .. U+005A	Latin capital letter A .. Z	`A B C D E F G H I J K L M` `N O P Q R S T U V W X Y Z`
U+005B	Left square bracket	`[`
U+005C	Reverse solidus	`\`
U+005D	Right square bracket	`]`
U+005E	Circumflex accent	`^`
U+005F	Low line	`_`
U+0061 .. U+007A	Latin small letter a .. z	`a b c d e f g h i j k l m` `n o p q r s t u v w x y z`
U+007B	Left curly bracket	`{`
U+007C	Vertical line	`\|`
U+007D	Right curly bracket	`}`
U+007E	Tilde	`~`

Basic literal character set

The basic literal character set consists of all characters of the basic character set, plus the following control characters:

`Code unit`	`Character`
U+0000	Null
U+0007	Bell
U+0008	Backspace
U+000D	Carriage return (CR)

Execution character set

The execution character set and the execution wide-character set are supersets of the basic literal character set. The encodings of the execution character sets and the sets of additional elements (if any) are locale-specific. Each element of execution wide-character set must be representable as a distinct wchar_t code unit.

Code unit and literal encoding

A code unit is an integer value of character type. Characters in a character literal other than a multicharacter or non-encodable character literal or in a string literal are encoded as a sequence of one or more code units, as determined by the encoding prefix; this is termed the respective literal encoding.

A literal encoding or a locale-specific encoding of one of the execution character sets encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element. A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. The encodings of the execution character sets can be unrelated to any literal encoding.

The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.

The U+0000 NULL character is encoded as the value 0. No other element of the translation character set is encoded with a code unit of value 0. The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous. The ordinary and wide literal encodings are otherwise implementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.

Pre-C++23 character sets (until C++23)

Basic source character set

The basic source character set consists of 96 characters:

the space character,
the control characters representing
- horizontal tab,
- vertical tab,
- form feed,
- and new-line,
plus the following 91 graphical characters:

`a`	`b`	`c`	`d`	`e`	`f`	`g`	`h`	`i`	`j`	`k`	`l`	`m`	`n`	`o`	`p`	`q`	`r`	`s`	`t`	`u`	`v`	`w`	`x`	`y`	`z`
`A`	`B`	`C`	`D`	`E`	`F`	`G`	`H`	`I`	`J`	`K`	`L`	`M`	`N`	`O`	`P`	`Q`	`R`	`S`	`T`	`U`	`V`	`W`	`X`	`Y`	`Z`
`0`	`1`	`2`	`3`	`4`	`5`	`6`	`7`	`8`	`9`
`_`	`{`	}	`[`	`]`	`#`	`(`	`)`	`<`	`>`	`%`	`:`	`;`	`.`	`?`	`*`	`+`	`-`	`/`	`^`	`&`	`\|`	`~`	`!`	`=`	`,`	`\`	`"`	`’`

The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

Basic execution character set

The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0.

For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.

Execution character set (Old definition)

The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

Defect reports

The following behavior-changing defect reports were applied retroactively to previously published C++ standards.

DR	Applied to	Behavior as published	Correct behavior
CWG 788	C++98	the values of the members of the execution character sets were implementation-defined, but were not locale-specific	they are locale-specific
CWG 1796	C++98	the representation of the null (wide) character in basic execution (wide-)character set had all zero bits	only required value to be zero

@@ Line 2: / Line 2: @@
 {{cpp/language/basics/navbar}}
-===Translation character set===
+===Current character sets {{mark since c++23}}===
+====Translation character set====
 The ''translation character set'' consists of the following elements:
 * each character named by [https://www.iso.org/standard/76835.html ISO/IEC 10646], as identified by its unique UCS scalar value, and
 * a distinct character for each UCS scalar value where no named character is assigned.
-===Basic character set===
+====Basic character set====
-{{anchor|Basic source character set}}
 The ''basic character set'' is a subset of the translation character set, consisting of the following 96 characters:
 {{cpp/language/basic charset}}
-Basic character set is historically known as ''basic source character set''.
+====Basic literal character set====
-===Basic literal character set===
-{{anchor|Basic execution character set}}
 The ''basic literal character set'' consists of all characters of the basic character set, plus the following control characters:
 {| class="wikitable" style="text-align: left;"
@@ Line 30: / Line 28: @@
 |}
-Basic literal character set is historically known as ''basic execution character set'' and ''basic execution wide-character set''.
+====Execution character set====
-===Execution character set===
 The execution character set and the execution wide-character set are supersets of the basic literal
 character set. The encodings of the execution character sets and the sets of additional elements
 (if any) are locale-specific. Each element of execution wide-character set must be representable as a distinct {{c|wchar_t}} code unit.
-===Code unit and literal encoding===
+====Code unit and literal encoding====
 A ''code unit'' is an integer value of character type. Characters in a {{rlp|character literal}} other than a multicharacter or non-encodable character literal or in a {{rlp|string literal}} are encoded as a sequence of one or more code units, as determined by the encoding prefix; this is termed the respective ''literal encoding''.
@@ Line 49: / Line 45: @@
 For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.
-===Notes===
+===Pre-C++23 character sets {{mark until c++23}}===
-Mapping from source file {{rev inl|since=c++23|(other than a UTF-8 source file)}} characters to the {{rev inl|until=c++23|basic character set}}{{rev inl|since=c++23|translation character set}} during {{rlp|translation phases#Phase1|translation phase 1}} is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.
+====Basic source character set====
+The ''basic source character set'' consists of 96 characters:
+* the space character,
+* the control characters representing
+** horizontal tab,
+** vertical tab,
+** form feed,
+** and new-line,
+* plus the following 91 graphical characters:
+{| class="wikitable" style="text-align: center;"
+|-
+|{{tt|a}}||{{tt|b}}||{{tt|c}}||{{tt|d}}||{{tt|e}}||{{tt|f}}||{{tt|g}}||{{tt|h}}||{{tt|i}}||{{tt|j}}||{{tt|k}}||{{tt|l}}||{{tt|m}}||{{tt|n}}||{{tt|o}}||{{tt|p}}||{{tt|q}}||{{tt|r}}||{{tt|s}}||{{tt|t}}||{{tt|u}}||{{tt|v}}||{{tt|w}}||{{tt|x}}||{{tt|y}}||{{tt|z}}
+|-
+|{{tt|A}}||{{tt|B}}||{{tt|C}}||{{tt|D}}||{{tt|E}}||{{tt|F}}||{{tt|G}}||{{tt|H}}||{{tt|I}}||{{tt|J}}||{{tt|K}}||{{tt|L}}||{{tt|M}}||{{tt|N}}||{{tt|O}}||{{tt|P}}||{{tt|Q}}||{{tt|R}}||{{tt|S}}||{{tt|T}}||{{tt|U}}||{{tt|V}}||{{tt|W}}||{{tt|X}}||{{tt|Y}}||{{tt|Z}}
+|-
+|{{tt|0}}||{{tt|1}}||{{tt|2}}||{{tt|3}}||{{tt|4}}||{{tt|5}}||{{tt|6}}||{{tt|7}}||{{tt|8}}||{{tt|9}}
+|-
+|{{tt|_}}||{{tt|{}}||{{tt|}}}||{{tt|[}}||{{tt|]}}||{{tt|#}}||{{tt|(}}||{{tt|)}}||{{tt|<}}||{{tt|>}}||{{tt|%}}||{{tt|:}}||{{tt|;}}||{{tt|.}}||{{tt|?}}||{{tt|*}}||{{tt|+}}||{{tt|-}}||{{tt|/}}||{{tt|^}}||{{tt|&}}||{{tt|&#124;}}||{{tt|~}}||{{tt|!}}||{{tt|{{=}}}}||{{tt|,}}||{{tt|\}}||{{tt|"}}||{{tt|’}}
+|}
+The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in {{rlp|translation phases#Phase1|translation phase 1}}) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
+====Basic execution character set====
+The ''basic execution character set'' and the ''basic execution wide-character set'' shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0.
+For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
+====Execution character set (Old definition)====
+The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.
 ===Defect reports===

Compiler support
Freestanding and hosted
Language
Standard library
Standard library headers
Named requirements
Feature test macros (C++20)
Language support library
Concepts library (C++20)
Metaprogramming library (C++11)
Diagnostics library
General utilities library
Strings library
Containers library
Iterators library
Ranges library (C++20)
Algorithms library
Numerics library
Localizations library
Input/output library
Filesystem library (C++17)
Regular expressions library (C++11)
Concurrency support library (C++11)
Execution support library (C++26)
Technical specifications
Symbols index
External libraries

Comments
ASCII
Punctuation
Names and identifiers
Types
Fundamental types
Objects
Scope
Object lifetime
Storage duration and linkage
Definitions and ODR
Name lookup
Qualified name lookup
Unqualified name lookup
The as-if rule
Undefined behavior
Memory model
Multi-threaded executions and data races (C++11)
Character sets and encodings
Phases of translation
The `main` function
Modules (C++20)

cppreference.com

Namespaces

Variants

Views

Actions

Difference between revisions of "cpp/language/charset"

Revision as of 17:52, 27 September 2022

Contents

Current character sets (since C++23)

Translation character set

Basic character set