absent std::u8string in C++11
Why C++11 provides std::u16string
and std::u32string
and not std::u8string
? We need to implement the utf-8 encoding or using additional libraries?
c++11 unicode utf-8
add a comment |
Why C++11 provides std::u16string
and std::u32string
and not std::u8string
? We need to implement the utf-8 encoding or using additional libraries?
c++11 unicode utf-8
5
Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it notchar
? And what do we have that is a string ofchar
? It isstd::string
. So no specificstd::u8string
really needed.
– Some programmer dude
Mar 20 '17 at 9:51
1
std::wstring
usedwchar_t
, and that size was underspecified (on some platforms, 16 and on others 32).u16string
andu32string
patch that hole.std::string
is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So eitheru8string
would could not (efficiently) exist, or it would be identical tostd::string
, on a given platform (really, both), assumingCHAR_BIT >= 8
.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20
1
std::u16string
andstd::u32string
exist because C++11 added new data types for them -char16_t
andchar32_t
, respectively. No new data type was added for handling UTF-8 (just a newu8
prefix for literals). Historically,std::string
has always been used for 8bit string data, and that has not changed. But if you really want au8string
type, there is nothing stopping you from declaring your owntypedef
/using
alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12
add a comment |
Why C++11 provides std::u16string
and std::u32string
and not std::u8string
? We need to implement the utf-8 encoding or using additional libraries?
c++11 unicode utf-8
Why C++11 provides std::u16string
and std::u32string
and not std::u8string
? We need to implement the utf-8 encoding or using additional libraries?
c++11 unicode utf-8
c++11 unicode utf-8
asked Mar 20 '17 at 9:48
Sergio
160110
160110
5
Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it notchar
? And what do we have that is a string ofchar
? It isstd::string
. So no specificstd::u8string
really needed.
– Some programmer dude
Mar 20 '17 at 9:51
1
std::wstring
usedwchar_t
, and that size was underspecified (on some platforms, 16 and on others 32).u16string
andu32string
patch that hole.std::string
is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So eitheru8string
would could not (efficiently) exist, or it would be identical tostd::string
, on a given platform (really, both), assumingCHAR_BIT >= 8
.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20
1
std::u16string
andstd::u32string
exist because C++11 added new data types for them -char16_t
andchar32_t
, respectively. No new data type was added for handling UTF-8 (just a newu8
prefix for literals). Historically,std::string
has always been used for 8bit string data, and that has not changed. But if you really want au8string
type, there is nothing stopping you from declaring your owntypedef
/using
alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12
add a comment |
5
Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it notchar
? And what do we have that is a string ofchar
? It isstd::string
. So no specificstd::u8string
really needed.
– Some programmer dude
Mar 20 '17 at 9:51
1
std::wstring
usedwchar_t
, and that size was underspecified (on some platforms, 16 and on others 32).u16string
andu32string
patch that hole.std::string
is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So eitheru8string
would could not (efficiently) exist, or it would be identical tostd::string
, on a given platform (really, both), assumingCHAR_BIT >= 8
.
– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20
1
std::u16string
andstd::u32string
exist because C++11 added new data types for them -char16_t
andchar32_t
, respectively. No new data type was added for handling UTF-8 (just a newu8
prefix for literals). Historically,std::string
has always been used for 8bit string data, and that has not changed. But if you really want au8string
type, there is nothing stopping you from declaring your owntypedef
/using
alias for it.
– Remy Lebeau
Mar 21 '17 at 0:12
5
5
Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not
char
? And what do we have that is a string of char
? It is std::string
. So no specific std::u8string
really needed.– Some programmer dude
Mar 20 '17 at 9:51
Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not
char
? And what do we have that is a string of char
? It is std::string
. So no specific std::u8string
really needed.– Some programmer dude
Mar 20 '17 at 9:51
1
1
std::wstring
used wchar_t
, and that size was underspecified (on some platforms, 16 and on others 32). u16string
and u32string
patch that hole. std::string
is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string
would could not (efficiently) exist, or it would be identical to std::string
, on a given platform (really, both), assuming CHAR_BIT >= 8
.– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20
std::wstring
used wchar_t
, and that size was underspecified (on some platforms, 16 and on others 32). u16string
and u32string
patch that hole. std::string
is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either u8string
would could not (efficiently) exist, or it would be identical to std::string
, on a given platform (really, both), assuming CHAR_BIT >= 8
.– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20
1
1
std::u16string
and std::u32string
exist because C++11 added new data types for them - char16_t
and char32_t
, respectively. No new data type was added for handling UTF-8 (just a new u8
prefix for literals). Historically, std::string
has always been used for 8bit string data, and that has not changed. But if you really want a u8string
type, there is nothing stopping you from declaring your own typedef
/using
alias for it.– Remy Lebeau
Mar 21 '17 at 0:12
std::u16string
and std::u32string
exist because C++11 added new data types for them - char16_t
and char32_t
, respectively. No new data type was added for handling UTF-8 (just a new u8
prefix for literals). Historically, std::string
has always been used for 8bit string data, and that has not changed. But if you really want a u8string
type, there is nothing stopping you from declaring your own typedef
/using
alias for it.– Remy Lebeau
Mar 21 '17 at 0:12
add a comment |
1 Answer
1
active
oldest
votes
C++20 adds char8_t
and std::u8string
. According to the proposal, the rationale is:
UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.
Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.
The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f42900323%2fabsent-stdu8string-in-c11%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
C++20 adds char8_t
and std::u8string
. According to the proposal, the rationale is:
UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.
Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.
The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.
add a comment |
C++20 adds char8_t
and std::u8string
. According to the proposal, the rationale is:
UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.
Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.
The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.
add a comment |
C++20 adds char8_t
and std::u8string
. According to the proposal, the rationale is:
UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.
Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.
The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.
C++20 adds char8_t
and std::u8string
. According to the proposal, the rationale is:
UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.
Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.
The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.
answered Nov 21 at 3:41
lz96
8771229
8771229
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f42900323%2fabsent-stdu8string-in-c11%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
Think again what UTF-8 is... Is it not a multi-byte encoding? Now what datatype in C++ typically represents a byte? Is it not
char
? And what do we have that is a string ofchar
? It isstd::string
. So no specificstd::u8string
really needed.– Some programmer dude
Mar 20 '17 at 9:51
1
std::wstring
usedwchar_t
, and that size was underspecified (on some platforms, 16 and on others 32).u16string
andu32string
patch that hole.std::string
is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So eitheru8string
would could not (efficiently) exist, or it would be identical tostd::string
, on a given platform (really, both), assumingCHAR_BIT >= 8
.– Yakk - Adam Nevraumont
Mar 20 '17 at 17:20
1
std::u16string
andstd::u32string
exist because C++11 added new data types for them -char16_t
andchar32_t
, respectively. No new data type was added for handling UTF-8 (just a newu8
prefix for literals). Historically,std::string
has always been used for 8bit string data, and that has not changed. But if you really want au8string
type, there is nothing stopping you from declaring your owntypedef
/using
alias for it.– Remy Lebeau
Mar 21 '17 at 0:12