absent std::u8string in C++11












4














Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?










share|improve this question


















  • 5




    Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
    – Some programmer dude
    Mar 20 '17 at 9:51








  • 1




    std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
    – Yakk - Adam Nevraumont
    Mar 20 '17 at 17:20








  • 1




    std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
    – Remy Lebeau
    Mar 21 '17 at 0:12
















4














Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?










share|improve this question


















  • 5




    Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
    – Some programmer dude
    Mar 20 '17 at 9:51








  • 1




    std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
    – Yakk - Adam Nevraumont
    Mar 20 '17 at 17:20








  • 1




    std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
    – Remy Lebeau
    Mar 21 '17 at 0:12














4












4








4


1





Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?










share|improve this question













Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?







c++11 unicode utf-8






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 20 '17 at 9:48









Sergio

160110




160110








  • 5




    Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
    – Some programmer dude
    Mar 20 '17 at 9:51








  • 1




    std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
    – Yakk - Adam Nevraumont
    Mar 20 '17 at 17:20








  • 1




    std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
    – Remy Lebeau
    Mar 21 '17 at 0:12














  • 5




    Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
    – Some programmer dude
    Mar 20 '17 at 9:51








  • 1




    std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
    – Yakk - Adam Nevraumont
    Mar 20 '17 at 17:20








  • 1




    std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
    – Remy Lebeau
    Mar 21 '17 at 0:12








5




5




Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
– Some programmer dude
Mar 20 '17 at 9:51






Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
– Some programmer dude
Mar 20 '17 at 9:51






1




1




std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20






std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20






1




1




std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12




std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12












1 Answer
1






active

oldest

votes


















4














C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:




UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.



Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.



The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.







share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f42900323%2fabsent-stdu8string-in-c11%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4














    C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:




    UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.



    Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.



    The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.







    share|improve this answer


























      4














      C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:




      UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.



      Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.



      The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.







      share|improve this answer
























        4












        4








        4






        C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:




        UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.



        Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.



        The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.







        share|improve this answer












        C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:




        UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.



        Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.



        The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.








        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 21 at 3:41









        lz96

        8771229




        8771229






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f42900323%2fabsent-stdu8string-in-c11%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            404 Error Contact Form 7 ajax form submitting

            How to know if a Active Directory user can login interactively

            Refactoring coordinates for Minecraft Pi buildings written in Python