Difference between revisions of "cpp/language/charset"
(Finished TODOs.) |
(Added execution character sets.) |
||
Line 27: | Line 27: | ||
|U+000D||Carriage return (CR) | |U+000D||Carriage return (CR) | ||
|} | |} | ||
+ | |||
+ | ====Execution character set==== | ||
+ | The execution character set and the execution wide-character set are supersets of the basic literal | ||
+ | character set. The encodings of the execution character sets and the sets of additional elements | ||
+ | (if any) are locale-specific. | ||
====Code unit and literal encoding==== | ====Code unit and literal encoding==== | ||
Line 32: | Line 37: | ||
A ''literal encoding'' or a locale-specific encoding of one of the execution character sets encodes | A ''literal encoding'' or a locale-specific encoding of one of the execution character sets encodes | ||
− | each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element. A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. | + | each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element. A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. The encodings of the execution character sets can be unrelated to any literal encoding. |
The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal. | The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal. | ||
Line 69: | Line 74: | ||
For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. | For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. | ||
+ | ====Execution character set (Old definition)==== | ||
The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific. | The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific. |
Revision as of 18:36, 18 November 2021
Contents |
Current character sets (since C++23)
Translation character set
The translation character set consists of the following elements:
- each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
- a distinct character for each UCS scalar value where no named character is assigned.
Basic character set
The basic character set is a subset of the translation character set, consisting of the following 96 characters:
Code unit | Character | Glyph |
---|---|---|
U+0009 | Character tabulation | |
U+000B | Line tabulation | |
U+000C | Form feed (FF) | |
U+0020 | Space | |
U+000A | Line feed (LF) | new-line |
U+0021 | Exclamation mark | !
|
U+0022 | Quotation mark | "
|
U+0023 | Number sign | #
|
U+0025 | Percent sign | %
|
U+0026 | Ampersand | &
|
U+0027 | Apostrophe | '
|
U+0028 | Left parenthesis | (
|
U+0029 | Right parenthesis | )
|
U+002A | Asterisk | *
|
U+002B | Plus sign | +
|
U+002C | Comma | ,
|
U+002D | Hyphen-minus | -
|
U+002E | Full stop | .
|
U+002F | Solidus | /
|
U+0030 .. U+0039 | Digit zero .. nine | 0 1 2 3 4 5 6 7 8 9
|
U+003A | Colon | :
|
U+003B | Semicolon | ;
|
U+003C | Less-than sign | <
|
U+003D | Equals sign | =
|
U+003E | Greater-than sign | >
|
U+003F | Question mark | ?
|
U+0041 .. U+005A | Latin capital letter A .. Z | A B C D E F G H I J K L M
|
U+005B | Left square bracket | [
|
U+005C | Reverse solidus | \
|
U+005D | Right square bracket | ]
|
U+005E | Circumflex accent | ^
|
U+005F | Low line | _
|
U+0061 .. U+007A | Latin small letter a .. z | a b c d e f g h i j k l m
|
U+007B | Left curly bracket | {
|
U+007C | Vertical line | |
|
U+007D | Right curly bracket | }
|
U+007E | Tilde | ~
|
Basic literal character set
The basic literal character set consists of all characters of the basic character set, plus the following control characters:
Code unit |
Character
|
---|---|
U+0000 | Null |
U+0007 | Bell |
U+0008 | Backspace |
U+000D | Carriage return (CR) |
Execution character set
The execution character set and the execution wide-character set are supersets of the basic literal character set. The encodings of the execution character sets and the sets of additional elements (if any) are locale-specific.
Code unit and literal encoding
A code unit is an integer value of character type. Characters in a character literal other than a multicharacter or non-encodable character literal or in a string literal are encoded as a sequence of one or more code units, as determined by the encoding prefix; this is termed the respective literal encoding.
A literal encoding or a locale-specific encoding of one of the execution character sets encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element. A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. The encodings of the execution character sets can be unrelated to any literal encoding.
The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.
The U+0000 NULL character is encoded as the value 0. No other element of the translation character set is encoded with a code unit of value 0. The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous. The ordinary and wide literal encodings are otherwise implementation-defined.
For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.
Pre-C++23 character sets (until C++23)
Basic source character set
The basic source character set consists of 96 characters:
- the space character,
- the control characters representing
- horizontal tab,
- vertical tab,
- form feed,
- and new-line,
- plus the following 91 graphical characters:
a |
b |
c |
d |
e |
f |
g |
h |
i |
j |
k |
l |
m |
n |
o |
p |
q |
r |
s |
t |
u |
v |
w |
x |
y |
z
| |||
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
K |
L |
M |
N |
O |
P |
Q |
R |
S |
T |
U |
V |
W |
X |
Y |
Z
| |||
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9
| |||||||||||||||||||
_ |
{ |
} |
[ |
] |
# |
( |
) |
< |
> |
% |
: |
; |
. |
? |
* |
+ |
- |
/ |
^ |
& |
| |
~ |
! |
= |
, |
\ |
" |
’
|
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
Basic execution character set
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0.
For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
Execution character set (Old definition)
The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.