How not to do Unicode...

4/29/12, by artyom ; Posted in: Unicode and Localization; 8 comments

All started from a small problem, how to print Unicode text to the Windows Console with option redirect to a file.

Let's say we have a program Hello that prints few words in several languages to the screen..

#include <stdio.h>

int main()
{
    printf("Мир Peace Ειρήνη\n");
    return 0;   
}

The program above is trivial and works fine under Windows if current console codepage is set to UTF-8. Also this can be fixed from the program by calling SetConsolseOutputCP(CP_UTF8).

Now simple tweak... Instead of that standard C printf we would use standard C++ std::cout... It works fine for GCC. But under Visual C++ it prints squares...

If I try redirection test.exe >test.txt - I get perfectly fine UTF-8 text...

I had started researching the issue and found the post of one of the Windows Unicode Gurus Michael Kaplan's.

I've tried to run _setmode(_fileno(stdout), _O_U8TEXT) as recommended by the Microsoft's Unicode guru and... By program crashed on attempt to write to the output stream.

Keeping searching for an answer I've got to this bug report...

Short summary:

User: Can't print UTF-8 to console with std::cout
MS: Closing - this is by design, see Michael Kaplan's article about writing to console
User: But if I do what suggested program crashes, and I still can't write Unicode to console
MS: Reactivate the ticket if it does not works
User: it does not!
MS: Use wide output...
User: I'd rather use fprintf in first place!?

To the summary...

If you use Visual C++ you can't use UTF-8 to print text to std::cout.

If you still want to, please read this amazingly long article about how to make wcout and cout working, but it does not really give a simple solution - finally falling to redefinition of the stream buffers...

So please, if you design API or Operating System, do not use kind of "Wide" API... This is is the wrong way to do Unicode.

Which reminds me... Spread around:

http://www.utf8everywhere.org/

Related Posts: http://blog.cppcms.com/post/62

Comments

Elazar, at 4/30/12, 9:18 AM

UTF-8 is problematic for the few languages that would have to use 3 bytes for every character, instead of two with UTF-16.

artyom, at 4/30/12, 11:46 AM

Elazar, I suggest you to read the article link I shown:

http://www.utf8everywhere.org/

It would clear many things including the "overhead" of the 3 bytes.

Elazar, at 5/1/12, 2:54 PM

I did, and that's what it says

For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world.

So if you don't care too much about characters not in the BMP (which is reasonable, many programs don't and get away with it), and you're handling Chinese, it makes sense to keep the internal strings in UTF-16, to keep memory footprint low.

Doesn't it?

artyom, at 5/1/12, 3:15 PM

Elazar,

You could also seen that any markup/meta data includes many ASCII characters and in real life the content stored with such metadata and in reality UTF-8 is shorter then UTF-16 even for East Asian texts.

Chinese, it makes sense to keep the internal strings in UTF-16, to keep memory footprint low.

Doesn't it?

Using UTF-16 as "storage optimization" is micro optimization that does not really help. If you want to make storage optimization you can use many Unicode compression algorithms.

So if you don't care too much about characters not in the BMP

Code points outside of BMP is integral part of Unicode. If you ignore them you are programming with bugs.

Elazar, at 5/1/12, 5:54 PM

Using UTF-16 as "storage optimization" is micro optimization that does not really help. If you want to make storage optimization you can use many Unicode compression algorithms.

In your environment it doesn't help. In other environment memory is sparse, and increasing std::string by 20% is a problem. Compression is not always a solution for the internal representation of strings in memory.

Code points outside of BMP is integral part of Unicode. If you ignore them you are programming with bugs.

We all program with bugs, the question is, how much this bugs cost us, and is fixing it worth it. I'm not sure it's worth it in the case of characters outside the BMP. Many world class programs ignore those characters, and the users are satisfied, e.g. IntelliJ Idea, try to delete a character outside the BMP, and you'll have to type the backspace twice.

ybungalobill, at 5/5/12, 9:32 PM

There are some frequently used non-BMP Chinese characters (in CJK Unified Ideographs Extensions B, C and D), so caring about storage optimization for Chinese text and not supporting non-BMP characters at the same time, is either complete ignorance or hypocrisy.

Elazar, at 6/5/12, 12:14 PM

@ybungalobill, Good point, still, you can use UTF-16 as optimization and have first class support for characters outside BMP. Those two discussion aren't necessarily related.

otstrel, at 2/12/13, 10:03 AM

Just wanted to thank you for this post, it was very helpful.

Add Comment:

Author
E-Mail	the email would not displayed
URL

You can write your messages using Markdown syntax.

You must enable JavaScript in order to post comments.

Project

Some rights reserved, the content of this blog is available under Creative Commons Attribution License 2.5 Israel.

Creative Commons

How not to do Unicode...

Comments

Add Comment:

Pages

Categories

Project

Search With Google