How not to do Unicode...
All started from a small problem, how to print Unicode text to the Windows Console with option redirect to a file.
Let's say we have a program Hello that prints few words in several languages to the screen..
#include <stdio.h>
int main()
{
printf("Мир Peace Ειρήνη\n");
return 0;
}
The program above is trivial and works fine under Windows if current console codepage is set to UTF-8. Also this can be fixed from the program by calling SetConsolseOutputCP(CP_UTF8)
.
Now simple tweak... Instead of that standard C printf
we would use standard C++ std::cout
... It works fine for GCC. But under Visual C++ it prints squares...
If I try redirection test.exe >test.txt
- I get perfectly fine UTF-8 text...
I had started researching the issue and found the post of one of the Windows Unicode Gurus Michael Kaplan's.
I've tried to run _setmode(_fileno(stdout), _O_U8TEXT)
as recommended by the
Microsoft's Unicode guru and... By program crashed on attempt to write
to the output stream.
Keeping searching for an answer I've got to this bug report...
Short summary:
- User: Can't print UTF-8 to console with std::cout
- MS: Closing - this is by design, see Michael Kaplan's article about writing to console
- User: But if I do what suggested program crashes, and I still can't write Unicode to console
- MS: Reactivate the ticket if it does not works
- User: it does not!
- MS: Use wide output...
- User: I'd rather use fprintf in first place!?
To the summary...
If you use Visual C++ you can't use UTF-8 to print text to std::cout
.
If you still want to, please read this amazingly long article about
how to make wcout
and cout
working, but it does not really give a simple
solution - finally falling to redefinition of the stream buffers...
So please, if you design API or Operating System, do not use kind of "Wide" API... This is is the wrong way to do Unicode.
Which reminds me... Spread around:
http://www.utf8everywhere.org/
Related Posts: http://blog.cppcms.com/post/62
Comments
UTF-8 is problematic for the few languages that would have to use 3 bytes for every character, instead of two with UTF-16.
Elazar, I suggest you to read the article link I shown:
http://www.utf8everywhere.org/
It would clear many things including the "overhead" of the 3 bytes.
I did, and that's what it says
So if you don't care too much about characters not in the BMP (which is reasonable, many programs don't and get away with it), and you're handling Chinese, it makes sense to keep the internal strings in UTF-16, to keep memory footprint low.
Doesn't it?
Elazar,
You could also seen that any markup/meta data includes many ASCII characters and in real life the content stored with such metadata and in reality UTF-8 is shorter then UTF-16 even for East Asian texts.
Using UTF-16 as "storage optimization" is micro optimization that does not really help. If you want to make storage optimization you can use many Unicode compression algorithms.
Code points outside of BMP is integral part of Unicode. If you ignore them you are programming with bugs.
In your environment it doesn't help. In other environment memory is sparse, and increasing std::string by 20% is a problem. Compression is not always a solution for the internal representation of strings in memory.
We all program with bugs, the question is, how much this bugs cost us, and is fixing it worth it. I'm not sure it's worth it in the case of characters outside the BMP. Many world class programs ignore those characters, and the users are satisfied, e.g. IntelliJ Idea, try to delete a character outside the BMP, and you'll have to type the backspace twice.
There are some frequently used non-BMP Chinese characters (in CJK Unified Ideographs Extensions B, C and D), so caring about storage optimization for Chinese text and not supporting non-BMP characters at the same time, is either complete ignorance or hypocrisy.
@ybungalobill, Good point, still, you can use UTF-16 as optimization and have first class support for characters outside BMP. Those two discussion aren't necessarily related.
Just wanted to thank you for this post, it was very helpful.
Add Comment:
You must enable JavaScript in order to post comments.