Talk:c/string/multibyte/c16rtomb
Contents |
[edit] Example
The example mentions "UCS-2 code units" and the Notes state that "this function can only convert single-unit 16-bit encoding, meaning it cannot convert UTF-16..." On the other hand, an earlier sentence on this page says "If the macro __STDC_UTF_16__ is defined, the 16-bit encoding used by this function is UTF-16," suggests that the function can convert UTF-16. And C11:6.10.8.2 mentions "values of type char16_t are UTF-16 encoded." C11:6.10.8.2 does not say UCS-2. I admit that I tried to convert the Gothic hwair character ([D800 DF48] in UTF-16) and got nowhere, but the two pairs of sentences seem to me to contradict each other. Any thoughts? Newatthis (talk) 07:17, 18 January 2017 (PST)
- it's a known defect in C11: Report: DR488 Fix: n2040. Since UTF-16 was the intent of this function, I think it's worth writing it up for UTF-16 with a note and with an example that can be used to test available implementations. --Cubbi (talk) 08:09, 18 January 2017 (PST)
- Thanks. I'll read the materials and then suggest an example. 08:41, 18 January 2017 (PST)
- I think I understand now. In n1570, functions c16rtomb() and mbrtoc16() are not inverses, but with the suggested fix n2040 they are, or will be in the future. Also, you wrote your example to work when the fix is in. Newatthis (talk) 02:21, 1 February 2017 (PST)
- I continue to study the fix n2040. If, as you say, "it's worth writing it up for UTF-16 with a note," shouldn't the second parameter of the function prototype at the top of the page be a pointer, as in char16_t *pc16, to accommodate the variable length of UTF-16 encoding? Newatthis (talk) 03:24, 1 February 2017 (PST)
[edit] using the term "wide"
I reread DR488 and the fix. DR488 emphasizes the confusion of the use of the term "wide" as in "wide character" since "wide character" is defined in C11:3.7.3. The fix completely drops the word "wide." I would like to delete the word "wide" from the c16rtomb()'s page. Any objections? Newatthis (talk) 05:31, 3 February 2017 (PST)
- fine, but what is it getting replaced by? This page uses the word 'wide' twice right now:
- "If c16 is the null wide character u'\0'" - here it could say "16-bit wide" to match c/language/character_constant
- "the final code unit in a 16-bit representation of a wide character" - here I suppose it could just say "a character" or even "a code point". --Cubbi (talk) 06:11, 3 February 2017 (PST)
- The third occurrence of "wide character" appears in the See Also subsection. What would replace "wide?" Here, "16-bit." On the page for c32tomb(), "32-bit." The terms "16-bit character" and "32-bit character" appear in 7.28/2. True, my suggested replacements do not appear formally defined in 3. Terms, definitions, and symbols, but they conform to the suggested replacement text appearing in the fix n2040, which does not use "wide" at all. On the other hand, 6.4.4.4/11 seems to extend the definition of "wide character" in 3.7.3 to include types char16_t or char32_t. And "16-bit wide character" might prompt a reader to think about the 16-bit wchar_t of Windows. I am still thinking about this. Newatthis (talk) 03:02, 4 February 2017 (PST)
- my point in saying "to match c/language/character_constant" (which is, basically, our version of 6.4.4.4) is to point out that it would require careful editing of more than just this page, since it would introduce something that disagrees with the standard site-wide - the standard calls all of u/U/L characters and strings "wide" all over the place despite the new wording of c16rtomb avoiding the word. To another person reading the rest of the standard (or perhaps even to the editor that merges n2040 into the standard), not using 'wide' when talking about c16 would be clearly wrong in terminology. --Cubbi (talk) 04:20, 4 February 2017 (PST)
[edit] hiding 0xd83c
I hid "the fact that the first 0xd83c should be swallowed with no output" to emphasize that the code point's representation of [ 0xd83c 0xdf4c ] was converted to the code point's representation of [ 0xf0 0x9f 0x8d 0x8c ] and to show how to use the return value of zero. This function is about converting a code point from one representation to another representation. Newatthis (talk) 02:43, 5 February 2017 (PST)
- it comes down to what the goal of the example is: practical use case (take a string, produce a string, all individual call details omitted as already explained in full detail on the same page - this kind of examples people copy into real code) or mechanics of a function call (take a char16_t, produce zero or more chars, a return code, and a state update - this kind of examples people copy into bug reports). Showing only some individual calls and treating only some of a char16_t string as a string, to me, appears to miss both goals: it's not something you'd use as a programmer and it's not showing what happens in the difficult case. --Cubbi (talk) 07:38, 5 February 2017 (PST)
- Thanks. Your explanation is helpful. Newatthis (talk) 02:33, 13 February 2017 (PST)
- Since your preference for examples is the "practical use case (take a string, produce a string," shouldn't the example here produce a string? Currently, the example yields only an array of chars which holds the encoding of one multibyte character. If you agree, I would like to make an attempt at modifying the example. Newatthis (talk) 05:21, 15 February 2017 (PST)
[edit] Scope of object named state
- In the examples of functions c16rtomb(),etc, the scope of variable state is file scope; in the examples of functions wcrtomb(),etc, the scope of variable state is block scope. Is one choice to be preferred over the other in these examples? Is thread safety something to be considered here? I think file scope objects are shared while block scope objects are not. I am trying to understand why the examples of the two sets of function apply different scopes to variable state. Newatthis (talk) 03:50, 24 February 2017 (PST)
- there is no need to complicate the examples by bringing in memset etc. --Cubbi (talk) 13:53, 26 February 2017 (PST)
- I mentioned the question about the scope of variable state because I had noticed that the ported examples of the restartable wide character functions used the block-scoped variable state and initialized variable state with memset(). Are those examples also unnecessarily complicated by their using memset() etc.? Newatthis (talk) 03:06, 27 February 2017 (PST)
[edit] Parameter ps
When describing parameter ps, six functions use the subordinate clause "when interpreting the multibyte string". This clause makes sense for functions mbrtowc, mbrtoc16, and mbrtoc32; these functions read a multibyte character. On the other hand, is this clause correct for functions wcrtomb, c16rtomb (when fixed), and c32rtomb? For the latter set of three functions, shouldn't the clause read, "when interpreting the wide character string?" The latter functions read a wide character, not a multibyte character. Newatthis (talk) 03:57, 16 April 2017 (PDT)
- the latter functions read from *ps in order to determine ("interpret") the conversion state of the multibyte string. Perhaps as worded it's not immediately clear, but 'interpreting the wide string' would make it incorrect. --Cubbi (talk) 07:08, 16 April 2017 (PDT)
- I understand that all six functions read from *ps to determine the conversion state. But the conversion state of what? The three mbrto* functions convert a multibyte character, so *ps tracks the conversion state during the conversion of a multibyte character; the three *rtomb functions convert a wide character (wchar_t,char16_t,char32_t), so *ps tracks the conversion state during the conversion of a wide character. When c16rtomb is fixed, will it not convert from a variable-length encoding schema, namely UTF-16; moreover, will it not use *ps to track the conversion state during the conversion of the one to two UTF-16 code units in a way that is similar to the way that the mbrto* functions use *ps to track the conversion state during the conversion of the one to four UTF-8 code units? I am suggesting replacing the clause "when interpreting the multibyte string" with "when converting the multibyte character string" in the cases of the mbrto* functions and with "when converting the wide character string" in the cases of the *rtomb functions. Newatthis (talk) 04:06, 17 April 2017 (PDT)
- all six functions read from *ps to determine the conversion state of the multibyte string. Three of them then read from that string, the other three then append to that string. The meaning of the bytes read and the values of the bytes written depend on the state obtained from *ps. It is not just "conversion state during the conversion of a wide character", multibyte encodings are, in general, stateful. --Cubbi (talk) 06:11, 17 April 2017 (PDT)
- Many thanks for your patience. I am just starting to see my misreading of ISO C on conversions. I'll read more about "appending," which I have seen described in other sources. I wish that the defect of c16rtomb were fixed so that I could compare what I am understanding about how mbrtoc16 and c16rtomb handle variable-length encoding schemas. Newatthis (talk) 08:02, 17 April 2017 (PDT)
- all six functions read from *ps to determine the conversion state of the multibyte string. Three of them then read from that string, the other three then append to that string. The meaning of the bytes read and the values of the bytes written depend on the state obtained from *ps. It is not just "conversion state during the conversion of a wide character", multibyte encodings are, in general, stateful. --Cubbi (talk) 06:11, 17 April 2017 (PDT)
- I understand that all six functions read from *ps to determine the conversion state. But the conversion state of what? The three mbrto* functions convert a multibyte character, so *ps tracks the conversion state during the conversion of a multibyte character; the three *rtomb functions convert a wide character (wchar_t,char16_t,char32_t), so *ps tracks the conversion state during the conversion of a wide character. When c16rtomb is fixed, will it not convert from a variable-length encoding schema, namely UTF-16; moreover, will it not use *ps to track the conversion state during the conversion of the one to two UTF-16 code units in a way that is similar to the way that the mbrto* functions use *ps to track the conversion state during the conversion of the one to four UTF-8 code units? I am suggesting replacing the clause "when interpreting the multibyte string" with "when converting the multibyte character string" in the cases of the mbrto* functions and with "when converting the wide character string" in the cases of the *rtomb functions. Newatthis (talk) 04:06, 17 April 2017 (PDT)
[edit] Update the link in Notes section?
The link in the Notes section points to the document n2059, "Defect Report Summary for C11 Version 1.10." Additional committee discussion about DR 488 appears in "Defect Report Summary for C11 Version 1.11." Should the link be updated to point to the newer document?
http://www.open-std.org/jtc1/sc22/WG14/www/docs/summary.htm#dr_488