Namespaces
Variants
Views
Actions

Talk:c/string/multibyte/mbrtowc

From cppreference.com

Contents

[edit] Revision as of 16:49, 18 February 2015

May I know the thinking behind this revision which changed

    static mbstate_t state; // zero-initialized

to

    mbstate_t state;
    memset(&state, 0, sizeof state);

? Newatthis (talk) 03:05, 10 March 2017 (PST)

the comment on that revision says "local state". I think I thought keeping the state local makes more sense for this demo because a function like "print_mb" might end up copied into beginner code even though there's a comment right above it saying how to do it right (wprintf). As noted in Talk:c/string/multibyte/c16rtomb, I currently think it's ok either way. --Cubbi (talk) 05:45, 10 March 2017 (PST)
Many thanks for sharing. I too think that it's ok either way but within the context of serial code. I had been wondering whether you made the change to the "local state" to position your example closer to thread safe code since a block scoped variable with automatic storage duration has a unique address in each thread. Newatthis (talk) 03:11, 11 March 2017 (PST)

[edit] (size_t) in the list of return values

Should the text (size_t) be deleted from this page? While ISO C11 mentions (size_t) in the list of return values of functions mbrtowc, mbrtoc16, and mbrtoc32, only function mbrtowc still includes it. I see the revision history of mbrtoc16 back in September, 2012, and wonder whether the text should be in or out. Newatthis (talk) 03:46, 24 March 2017 (PDT)

The return value of this function has type size_t, the expression -1 does not, and the value -1 is not a value of type size_t. Although it is obvious what happens to the target audience of the standard (language implementors), it may not be clear to the target audience of cppreference (language users). I would rather add the explicit cast to the description of the return value in mbrtoc32 and anything else that claims to return negative numbers despite the return type being unsigned. --Cubbi (talk) 06:34, 24 March 2017 (PDT)
Me too. Will do. Newatthis (talk) 07:12, 24 March 2017 (PDT)

[edit] mbrtowc(NULL, "", 1, ps)

Here is the discussing before the editing: I wish to include a note addressing the intent and a conflict related to 7.29.6.3.2/2.

Notes

In C11 as published, the intended usage of "mbrtowc(NULL, "", 1, ps)" seems to be resetting the conversion state object to an initial conversion state. In other words, a '\0' byte occurring in a continuation byte of a multibyte character sequence should be unambiguously considered a null character, not an illegal sequence, regardless the current conversion state. This conflicts with the semantics of mbrtowc's storage of a partially-converted multibyte character. Therefore, there is no way to reset the internal conversion state (ps == NULL) once an illegal sequence has been encountered despite that being the intent. Programs needing the functionality of mbrtowc should use it with a non-null conversion-state pointer.Newatthis (talk) 04:17, 16 September 2017 (PDT)

way too much text for a note, and it is addressing the wrong audience. What are you actually trying to say, to a programmer, not a WG14 member? That ps should not be a null pointer? This page never suggested it could be. That multibyte string is terminated on the first '\0'? That would be part of the definition of multibyte string. --Cubbi (talk) 05:24, 16 September 2017 (PDT)
My suggested note is just a draft. It is not cast in stone. It has two purposes: layout a topic and start a discussion. At the same time, I am always looking for signs which will explain to me the rules that include/exclude text from the pages of cppreference. For a brief moment, I wish to pursue that last thought. The standards document introduces the ordered list of return values of function mbrtowc() with the parenthetical phrase "(given the current conversion state):." Help me out here. Why was this wording excluded from the Returns section of cppreference's description of this function? Excluding the phrase seems to me to change the grammar of the ordered list. So, what is the thinking behind the exclusion? Newatthis (talk) 03:11, 19 September 2017 (PDT)
No part of the standard is "included" or "excluded". It is a different document with a different target audience. If someone wants to read it, there are links at Cppreference:FAQ. Have you used this function in production? What problems did you encounter? What questions did you and the programmers who work with you have? What did it do that wasn't obvious from the existing description? What did you have to look up in other references or Q&A forums? What do other references (posix, linux, etc) say about it function? Quoting the standard for everything, besides being potentially illegal, is not going to make this useful as a reference. Speaking of usefulness, __STDC_MB_MIGHT_NEQ_WC__ is both irrelevant and already described on the predefined macros page, why did you copy it here? --Cubbi (talk) 06:44, 19 September 2017 (PDT)
I shall try to unpack your questions one by one starting with the last question since it is about a different topic. Also, this question gives me a chance to explain how I approach cppreference before making an edit. I noticed in the history of both mbrtoc16()/c16rtomb() and mbrtoc32()/c32rtomb() that from the beginning, 10 Sep 2012, these pages mentioned macros __STDC_UTF_16__ and __STDC_UTF_32__, respectively. Today, I verified that these two macro names appeared in cppreference's predefined macros page on 28 June 2014. During the three years since, noone saw fit to make the argument that the macros are now already described on the predefined macros page and that their descriptions should now be removed from the function pages. Seeing the macros on both sets of pages did not prompt me to think that the predefined macros page was sufficient by itself. Then I realized that macro __STDC_ISO_10646__ has a similar relation with mbrtowc()/wcrtomb(). So, on 7 June 2017, I added a description while using P12's efforts as a role model. From there, it was a small step to argue that macro __STDC_MB_MIGHT_NEQ_WC__ is also related to mbrtowc()/wcrtomb() and like __STDC_UTF_16__ and __STDC_UTF_32__ deserves a mention on the function pages. I admit that at the time I did not consider the issue of relevancy. I saw something specified in the standards document and thought that it deserved an explanation in cppreference's reference document. I thought that a brief comment on the function pages would reveal a larger picture about mbrtowc()/wcrtomb(). Further, I cannot know all that you consider relevant. What I can consider is what the standards deemed sufficiently relevant to include in its specifications. Newatthis (talk) 06:48, 20 September 2017 (PDT)
The macro __STDC_MB_MIGHT_NEQ_WC__ has nothing at all to do with these functions or with any part of the C standard library. It describes a core language feature. --Cubbi (talk) 07:43, 20 September 2017 (PDT)
"Quoting the standard for everything, besides being potentially illegal, is not going to make this useful as a reference." I did not mean to imply that the missing parenthetical phrase should be lifted verbatim from the standards document. If I seemed to ask why the exact phrase was missing, I should have been more explicit by asking why something like the phrase was missing from cppreference's page. (I just noticed that the same parenthetical phrase qualifies the Returns section on the standard's subclauses specifying mbrtoc16() and mbrtoc32().) If I learned anything from the C books which I have purchased since the 1980s, it is that there are several ways to say the same thing and that some ways are clearer than others. I'll return to that parenthetical phrase in a minute. Our discussion is sharpening my thoughts about it. Newatthis (talk) 06:48, 20 September 2017 (PDT)
"What questions did you and the programmers who work with you have?" Unfortunately, I have no programming colleagues. What questions do I have? Well, I "asked" a single question in the rough draft of my suggested note in order to start this discussion. In a moment I shall re-ask that question in, I hope, a clearer fashion. Newatthis (talk) 06:48, 20 September 2017 (PDT)
"Have you used this function in production?" No. I approach this function as a novitiate seeking enlightenment. I am reading the standards document, cppreference's take on the standards, and C books. 99% of the C books in my library do not even mention the C library. I surf the Internet with the search string "mbrtowc" and experience disappointment. Newatthis (talk) 06:48, 20 September 2017 (PDT)
"What do other references (posix, linux, etc) say about it function?" The Web pages http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbrtowc.html and https://www.freebsd.org/cgi/man.cgi?query=mbrtowc&sektion=3&apropos=0&manpath=FreeBSD+11.1-RELEASE+and+Ports exclude the parenthetical phrase. BUT, n1570 does include it. Newatthis (talk) 06:48, 20 September 2017 (PDT)
"What problems did you encounter? What did it do that wasn't obvious from the existing description?" Now we enter the weeds. Consider the banana symbol in a multibyte character string: f0 9f 8d 8c 00. Next, consider that string to be corrupted: f0 9f 8d 00 00. I understand that the standard states that this case is not allowed (5.2.1.1/Bullet #4): "A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character." But, what if it does? What if the null character is part of some other multibyte character, as in the corruption above? How does mbrtowc() handle it, and how does this function rebound from the unspecified conversion state?
"What happens with f0 9f 8d 00 00?" is the only part of this that makes sense to me. Since 00 cannot follow 8d in UTF-8, I expect EILSEQ. Violation of the standard clause "Such a byte shall not occur as part of any other multibyte character." is undefined behavior (per $4.2), but to get there you would need to load an imaginary locale that defines an MB encoding that uses embedded zeroes (or leading zeroes in non-initial shift states). --Cubbi (talk) 07:43, 20 September 2017 (PDT)
May I impose on you to continue your line of thought about what happens with f0 9f 8d 00 00? I too expect EILSEQ but only after I take into account the current conversion state (by that I mean I apply the parenthetical phrase); otherwise, I think that a null character follows 8d because that is the first entry in the ordered list of returns. Applying mbrtowc with byte-at-a-time decoding has the conversion state in some partial-character state, i.e. some non-initial state, when the function encounters that first 00 after 8d; after looking at that 00, the function leaves the conversion state unspecified. Can mbrtowc then be invoked in some way which resets its conversion state object to an initial conversion state, or is the object locked in that partial-character state? I have been trying to tease the answer from the C11:7.29.6.3.2/2 and from cppreference without success. Newatthis (talk) 04:15, 21 September 2017 (PDT)
How to reset a trashed mbstate? See mbstate_t or in fact in the very first sentence of this Talk page. --Cubbi (talk) 06:00, 21 September 2017 (PDT)
Thanks. I now understand that we do not use mbrtowc itself to reset the trashed user-provided mbstate. Also, as far as I can tell, there is no way to reset a trashed internal conversion state. Newatthis (talk) 07:04, 21 September 2017 (PDT)
There is no internal conversion state here. Eliminating that was the whole point of mbrtowc. --Cubbi (talk) 07:35, 21 September 2017 (PDT)
Is the first occurrence of the null character in the banana string an unambiguous string terminator or is it part of an illegal sequence? As far as I can tell, both C translators available at Coliru respond with illegal sequence. And, I guess the current conversion state is unspecified. Perhaps you possess C translators which I cannot afford. Perhaps you can run the corrupted example through those translators. Excluding the parenthetical phrase implies, it seems to me, that the first occurrence of the null character in the corrupted banana string is an unambiguous string terminator, and so zero should be returned. But, including the parenthetical phrase, like the standard does, seems to imply the possibility of bypassing the order of the entries of the list of Returns. Perhaps the first return that applies really is "illegal sequence," but then invoke this function separately with "s==NULL". I get another illegal sequence because the conversion state was left as a non-initial conversion state. Then use memset() to reset the conversion state and again invoke mbrtowc() with "s==NULL and get the zero return value. I.E. "given the current conversion state." Maybe what I really should be asking is this: is the parenthetical phrase in the standards document superfluous? And is that why it does not appear in cppreference? Thanks for taking the time. Newatthis (talk) 06:48, 20 September 2017 (PDT)
We already say that conversion state is used. I suppose it could say more clearly that it's an in/out parameter, but it doesn't make sense in the section that describes the return value, since it's not part of the return value. As you saw, other references agree. --Cubbi (talk) 07:43, 20 September 2017 (PDT)
Sorry, I am unable to see where this page already says that the conversion state is used. What I do see is, "(including any shift sequences)." I guess I am missing something. On the other hand, the page about wcrtomb() reads, "(including any shift sequences, and taking into account the current multibyte conversion state *ps)." Should cppreference's page for mbrtowc() say the same thing? Newatthis (talk) 05:44, 25 September 2017 (PDT)
yes, saying, in the first sentence (not in return value description) "taking into account the current multibyte conversion state *ps" is what I mean by "it could say more clearly that it's an in/out parameter". Right now this page only says the conversion state is "used when interpreting" in the description of the conversions state parameter. --Cubbi (talk) 08:23, 26 September 2017 (PDT)

[edit] int rc; or size_t rc;

Examples of related conversion functions are now using size_t as the type of the return code variable. The type of rc has flip-flopped in the examples. Earlier this year the rationale that int made an example more readable accompanied an edit of this page. In the interest of consistency, should the type be int or size_t throughout the examples of the six related conversion functions? Newatthis (talk) 02:44, 22 September 2017 (PDT)

I still believe int rc is the right way, but with a second editor changing the code to size_t, I would look for a published design rationale (or at least another editor's professional opinion). To compare my experience with others, I just scanned the first 7 out of 45 pages of Debian code search results (grouped by package) for production use cases
  • size_t users: gcc, zsh, putty, radare2, libiconv, chromium-browser, mutt, xz-utils
  • int users: gnutls28, libiberty, gnupg and other users of regex_internal.c, newt, dbacl
  • both: readline (size_t in mbutil.c's _rl_find_next_mbchar_internal and int in complete.c's compute_lcd_of_matches)
seems current practice does not give a notable preference. --Cubbi (talk) 08:04, 22 September 2017 (PDT)
While I have no strong preference between the two types (probably due to inexperience), I do prefer self-consistency among the examples of the six functions and perhaps a consistency with the type mentioned in the six Synopses: size_t. Newatthis (talk) 03:49, 23 September 2017 (PDT)