absent std::u8string in C++11

Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?

asked Mar 20 '17 at 9:48

Sergio

160110

5

Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
– Some programmer dude
Mar 20 '17 at 9:51

1

std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20

1

std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12

add a comment |

Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?

asked Mar 20 '17 at 9:48

Sergio

160110

5

Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
– Some programmer dude
Mar 20 '17 at 9:51

1

std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20

1

std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12

add a comment |

Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?

asked Mar 20 '17 at 9:48

Sergio

160110

Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?

c++11 unicode utf-8

asked Mar 20 '17 at 9:48

Sergio

160110

asked Mar 20 '17 at 9:48

Sergio

160110

asked Mar 20 '17 at 9:48

Sergio

160110

asked Mar 20 '17 at 9:48

Sergio

160110

asked Mar 20 '17 at 9:48

Sergio

160110

5

Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
– Some programmer dude
Mar 20 '17 at 9:51

1

std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20

1

std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12

add a comment |

5

Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
– Some programmer dude
Mar 20 '17 at 9:51

1

std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20

1

std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12

Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not char? And what do we have that is a string of char? It is std::string. So no specific std::u8string really needed.
– Some programmer dude
Mar 20 '17 at 9:51

std::wstring used wchar_t, and that size was underspecified (on some platforms, 16 and on others 32). u16string and u32string patch that hole. std::string is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string would could not (efficiently) exist, or it would be identical to std::string, on a given platform (really, both), assuming CHAR_BIT >= 8.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20

std::u16string and std::u32string exist because C++11 added new data types for them - char16_t and char32_t, respectively. No new data type was added for handling UTF-8 (just a new u8 prefix for literals). Historically, std::string has always been used for 8bit string data, and that has not changed. But if you really want a u8string type, there is nothing stopping you from declaring your own typedef/using alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12

add a comment |

1 Answer
1

active

oldest

votes

C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.

Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.

answered Nov 21 at 3:41

lz96

8771229

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f42900323%2fabsent-stdu8string-in-c11%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.

Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.

answered Nov 21 at 3:41

lz96

8771229

add a comment |

C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.

Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.

answered Nov 21 at 3:41

lz96

8771229

add a comment |

C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.

Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.

answered Nov 21 at 3:41

lz96

8771229

C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.

Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.

answered Nov 21 at 3:41

lz96

8771229

answered Nov 21 at 3:41

lz96

8771229

answered Nov 21 at 3:41

lz96

8771229

answered Nov 21 at 3:41

lz96

8771229

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk