Home  /  RSS  /  RSS Comments  /  RSS for Unicode and Localization  /  Enter

Posts in category ‘Unicode and Localization’.

New version of Boost.Locale released

Saturday, March 13, 2010, by artyom ; Posted in: Progress, Unicode and Localization; 0 comments

New version of Boost.Locale releasedI'm glad to introduce updated version of Boost.Locale library: This library was designed for Boost and created because of needs CppCMS framework.

Boost.Locale is a library that brings high quality localization facilities in C++ way. It provides the natural glue between C++ locales framework, iostreams and powerful ICU library giving:

New in this version:

  1. Fully redesigned break iterator interface
  2. Added date_time and calendar support that allow manipulating with dates over non-Gregorian calendars
  3. Implemented full set of unit-tests
  4. Added support of many platforms and compilers.
  5. Lots of bug fixes.
  6. Complete reference with Doxygen
  7. Many tutorial updates, thanks to Markus Raab inputs

So if you are interested:

Introducing Boost.Locale

Sunday, November 8, 2009, by artyom ; Posted in: Progress, Unicode and Localization; 0 comments

After a long period of hesitating I had understood -- standard C++ locale facets are no-go and started developing localization tools based on ICU that work in C++ friendly way. Thus, Boost.Locale was born. It had just been announced to the Boost community for preliminary review.

Boost.Locale provides case conversion and folding, Unicode normalization, collation, numeric, currency, and date-time formatting, messages formatting, boundary analysis and code-page conversions in C++ aware way.

For example in order to display a currency value it is enough to write this:

cout << as::currency  << 123.34 << endl;

And currency like "$124.34" would be formatted. Spelling a number?

cout << as::spellout << 1024 << endl;

Very simple. And much more. This library would be the base for CppCMS localization support and I hope it would be also accepted in Boost at some point in future.

I've tested this library with:

Documentation

Source Code

Is available from SVN repository.

svn co https://cppcms.svn.sourceforge.net/svnroot/cppcms/boost_locale/trunk

Building

You need CMake 2.4 and above, ICU 3.6 and above, 4.2 recommended So checkout the code, and run

cmake /path/to/boost_locale/libs/locale
make

Inputs and comments are welcome.

Localization in 2009 and broken standard of C++.

Thursday, October 8, 2009, by artyom ; Posted in: Unicode and Localization; 5 comments

There are many goodies in upcoming standard C++0x. Both, core language and standard libraries were significantly improved.

However, there is one important part of the library that remains broken -- localization.

Let's write a simple program that prints number to file in C++:

#include <iostream>
#include <fstream>
#include <locale>


int main()
{
        // Set global locale to system default;
        std::locale::global(std::locale(""));

        // open file "number.txt"
        std::ofstream number("number.txt");

        // write a number to file and close it
        number<<13456<<std::endl;
}

And in C:

#include <stdio.h>
#include <locale.h>

int main()
{
        setlocale(LC_ALL,"");
        FILE *f=fopen("number.txt","w");
        fprintf(f,"%'f\n",13456);
        fclose(f);
        return 0;
}

Lets run both programs with en_US.UTF-8 locale and observe the following number in the output file:

13,456

Now lets run this program with Russian locale LC_ALL=ru_RU.UTF-8 ./a.out. C version gives us as expected:

13 456

When C++ version produces:

13<?>456

Incorrect UTF-8 output text! What happens? What is the difference between C library and C++ library that use same locale database?

According to the locale, the thousands separator in Russian is U+2002 -- EN SPACE, the codepoint that requires more then one byte in UTF-8 encoding. But let's take a look on C++ numbers formatting provider: std::numpunct. We can see that member functions thousands_sep returns single character. When in C locale definition, thousands separator represented as a string, so there is no limitation of single character as in C++ standard class.

This was just a simple and easily reproducible problems with C++ standard locale facets. There much more:

Its very frustrating that in 2009 such annoying, easily reproducible bugs exist and make localization facilities totally useless in certain locales.

All the work I had recently done with support of localization in CppCMS framework had convinced me in important decision --- ICU would be mandatory dependency and provide most of localization facilities by default, because native C++ localization is no-go...

The question is: "Would C++0x committee revisit localization support in C++0x?"

Unicode in 2009? Why is it so hard?

Wednesday, April 15, 2009, by artyom ; Posted in: Framework, Unicode and Localization; 8 comments

From my point of view, one of the most missing features in C++ is the lack of good Unicode support. C++ provides some support via std::wstring and std::locale, but it is quite limited for real live purposes.

This definitely makes the life of C++ (Web) Developers harder.

However there are several tools and toolkits that provide such support. I had checked 6 of them: ICU library with bindings to C++, Java and Python, Qt3 and Qt4, glib/pango and native support of Java/JDK, C++ and Python.

I did little bit challenging test for correctness:

Basic features like encoding conversions and simple case conversion like "Артём" (my name in Russian) to "АРТЁМ" worked well in all tools. But more complicated test results were quite bad:

Results

TookitTo Upper CaseTo Lower CaseWord Boundaries
C++FailFailNo Support
C++/ICU‎OkOkOk
C++/Qt4‎OkFailOk
C++/Qt3‎FailFailNo Support
C/glib+pangoOkOkFail
Java/JDKOkOkFail
Java/ICU4jOkOkOk
PythonFailFailNo Support
Python/PyICU‎OkOkOk

Description

ICU: Provides great support but... it has very unfriendly and old API in terms of C++ development. The documentation is really bad.

Qt4: Gives good results and friendly API, has great documentation, but as we can see, some tests are failed. Generally, useful in web projects.

Qt3: Provides very basic Unicode support, no reason to use any more, especially when Qt4.5 is released under LGPL.

C++/STL: Even basic support exists, the API is not too friendly to STL containers and requires explicit usage of char * or wchar_t * and manual buffers allocation.

Glib: Gives quite good basic functionality. But finding word boundaries with Pango is really painful and does not work with Chinese. It has very nice C API and quite well documented. It uses internally utf-8 which makes the life easier when working with C strings. It still requires wrapping its functionality with C++ classes or grabbing huge GtkMM.

Python: has very basic native Unicode support. PyICU has terrible documentation.

Java: JDK provides quite good Unicode support, it can be quite easily replaced by ICU4J (actually most of JDK is based on ICU).

Summary

It is a shame that in 2009 there is no high quality, well documented, C++ friendly toolkit to work with Unicode.

When there will be Boost.ICU or Boost.Unicode just like there is Boost.Math or Boost.Asio?

Thread Safe Implementation of GNU gettext

Saturday, April 26, 2008, by artyom ; Posted in: Progress, Templates, Unicode and Localization; 0 comments

There is widely available software internationalization tool called GNU gettext. Is is used as base for almost all FOSS software tools. It has binding to almost every language and supports many platforms including Win32.

How does it works? In any place you need to display a string that may potentially show in other language then English you just write:

printf(gettext("Hello World\n"));

And you get the required translation for this string (if available).

In 99% of cases this is good enough. However, as you can see, there is no parameter "target language". It is defined for entry application.

What happends if you need to display this string in different languages? You need to switch locale, and this operation is not thread safe. In most of cases you do not need to do this, because almost all applications will "talk" in single language that user had asked. However this is not the case of web based applications.

Certain web application allow you to display content in several languages: think of government site that should display information in three languages: Hebrew, Arabic and English. So you may need to define the translation per each session you open or use.

So, if you write a multithreaded FastCGI application that supports different languages is signle instance you can't use gettext.

more...

next page

next page

Pages

Categories