Localization in 2009 and broken standard of C++.

10/8/09, by artyom ; Posted in: Unicode and Localization; 5 comments

There are many goodies in upcoming standard C++0x. Both, core language and standard libraries were significantly improved.

However, there is one important part of the library that remains broken -- localization.

Let's write a simple program that prints number to file in C++:

#include <iostream>
#include <fstream>
#include <locale>


int main()
{
        // Set global locale to system default;
        std::locale::global(std::locale(""));

        // open file "number.txt"
        std::ofstream number("number.txt");

        // write a number to file and close it
        number<<13456<<std::endl;
}

And in C:

#include <stdio.h>
#include <locale.h>

int main()
{
        setlocale(LC_ALL,"");
        FILE *f=fopen("number.txt","w");
        fprintf(f,"%'f\n",13456);
        fclose(f);
        return 0;
}

Lets run both programs with en_US.UTF-8 locale and observe the following number in the output file:

13,456

Now lets run this program with Russian locale LC_ALL=ru_RU.UTF-8 ./a.out. C version gives us as expected:

13 456

When C++ version produces:

13<?>456

Incorrect UTF-8 output text! What happens? What is the difference between C library and C++ library that use same locale database?

According to the locale, the thousands separator in Russian is U+2002 -- EN SPACE, the codepoint that requires more then one byte in UTF-8 encoding. But let's take a look on C++ numbers formatting provider: std::numpunct. We can see that member functions thousands_sep returns single character. When in C locale definition, thousands separator represented as a string, so there is no limitation of single character as in C++ standard class.

This was just a simple and easily reproducible problems with C++ standard locale facets. There much more:

std::time_get -- is not symmetric with std::time_put (as it in C strftime/strptime) and does not allow easy parsing of times with AM/PM marks.
std::ctype is very simplistic assuming that toupper/tolower can be done on per-character base (case conversion may change number of characters and it is context dependent).
std::collate -- does not support collation strength (case sensitive or insensitive).
There is not way to specify a timezone different from global timezone in time formatting and parsing.
Time formatting/parsing always assumes Gregorian calendar.

Its very frustrating that in 2009 such annoying, easily reproducible bugs exist and make localization facilities totally useless in certain locales.

All the work I had recently done with support of localization in CppCMS framework had convinced me in important decision --- ICU would be mandatory dependency and provide most of localization facilities by default, because native C++ localization is no-go...

The question is: "Would C++0x committee revisit localization support in C++0x?"

Comments

nenTi, at 10/25/09, 10:03 AM

I love the idea of your project :) the blog was a litle slow last days but I'm soooo cool using this framework :)

artyom, at 10/25/09, 9:16 PM

I'm glad to hear ;)

Viet, at 10/27/09, 12:32 PM

I'm very looking forward to growth and development of this project. It's awesome to write web app/services in C++ ;)

Keep up the good work!

Fatman, at 8/26/10, 12:56 PM

Sigh. If I had a penny for every time this tripped up my code... I'd have 37p. If locales worked in MinGW, wide strings would work too - and coding drivers in Windows would be much easier.

Maybe I should mess around with STLport for a bit. You'd think the MinGW guys could snag the STLport code and squeeze it into the win32api libs.

artyom, at 8/28/10, 3:53 PM

Famtman,

If locales worked in MinGW,

In GCC locales are supported only under Linux.

If you want good localization support try to use Boost.Locale.

BTW, there is nothing wrong with wide string support in MinGW as long as you use 4.x series, 4.5 is best.

Add Comment:

Author
E-Mail	the email would not displayed
URL

You can write your messages using Markdown syntax.

You must enable JavaScript in order to post comments.

Project

Some rights reserved, the content of this blog is available under Creative Commons Attribution License 2.5 Israel.

Creative Commons

Localization in 2009 and broken standard of C++.

Comments

Add Comment:

Pages

Categories

Project

Search With Google