Posts in category ‘Unicode and Localization’.
New version of Boost.Locale released
New version of Boost.Locale releasedI'm glad to introduce updated version of Boost.Locale library: This library was designed for Boost and created because of needs CppCMS framework.
Boost.Locale is a library that brings high quality localization facilities in C++ way. It provides the natural glue between C++ locales framework, iostreams and powerful ICU library giving:
- Correct case conversion, case folding and normalization
- Collation including support of 4 Unicode collation levels.
- Date, time, timezone and calendar manipulations, formatting and parsing including transparent support of calendars other then Gregorian.
- Boundary analysis for characters, words, sentences and line-breaks.
- Number formatting, spelling and parsing.
- Monetary formatting and parsing.
- Powerful message formatting including support plural forms, using GNU catalogs.
- Character set conversion.
- Transparent support of 8-bit character sets like Latin1.
- Support of
char
,wchar_t
and C++0xchar16_t
,char32_t
strings and streams.
New in this version:
- Fully redesigned break iterator interface
- Added date_time and calendar support that allow manipulating with dates over non-Gregorian calendars
- Implemented full set of unit-tests
- Added support of many platforms and compilers.
- Lots of bug fixes.
- Complete reference with Doxygen
- Many tutorial updates, thanks to Markus Raab inputs
So if you are interested:
- Documentation: http://cppcms.sourceforge.net/boost_locale/html/index.html
- Tutorial: http://cppcms.sourceforge.net/boost_locale/html/tutorial.html
- Download: https://sourceforge.net/projects/cppcms/files/
Introducing Boost.Locale
After a long period of hesitating I had understood -- standard C++ locale facets are no-go and started developing localization tools based on ICU that work in C++ friendly way. Thus, Boost.Locale was born. It had just been announced to the Boost community for preliminary review.
Boost.Locale provides case conversion and folding, Unicode normalization, collation, numeric, currency, and date-time formatting, messages formatting, boundary analysis and code-page conversions in C++ aware way.
For example in order to display a currency value it is enough to write this:
cout << as::currency << 123.34 << endl;
And currency like "$124.34" would be formatted. Spelling a number?
cout << as::spellout << 1024 << endl;
Very simple. And much more. This library would be the base for CppCMS localization support and I hope it would be also accepted in Boost at some point in future.
I've tested this library with:
- Linux GCC 4.1, 4.3. with ICU 3.6 and 3.8
- Windows MSVC-9 (VC 2008), with ICU 4.2
- Windows MingW with ICU 4.2
- Windows Cygwin with ICU 3.8
Documentation
- Full tutorials: http://cppcms.sourceforge.net/boost_locale/docs/
- Doxygen reference: http://cppcms.sourceforge.net/boost_locale/docs/doxy/html/
Source Code
Is available from SVN repository.
svn co https://cppcms.svn.sourceforge.net/svnroot/cppcms/boost_locale/trunk
Building
You need CMake 2.4 and above, ICU 3.6 and above, 4.2 recommended So checkout the code, and run
cmake /path/to/boost_locale/libs/locale
make
Inputs and comments are welcome.
Localization in 2009 and broken standard of C++.
There are many goodies in upcoming standard C++0x. Both, core language and standard libraries were significantly improved.
However, there is one important part of the library that remains broken -- localization.
Let's write a simple program that prints number to file in C++:
#include <iostream>
#include <fstream>
#include <locale>
int main()
{
// Set global locale to system default;
std::locale::global(std::locale(""));
// open file "number.txt"
std::ofstream number("number.txt");
// write a number to file and close it
number<<13456<<std::endl;
}
And in C:
#include <stdio.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL,"");
FILE *f=fopen("number.txt","w");
fprintf(f,"%'f\n",13456);
fclose(f);
return 0;
}
Lets run both programs with en_US.UTF-8
locale and observe the following number in the output file:
13,456
Now lets run this program with Russian locale LC_ALL=ru_RU.UTF-8 ./a.out
. C version gives us as expected:
13 456
When C++ version produces:
13<?>456
Incorrect UTF-8 output text! What happens? What is the difference between C library and C++ library that use same locale database?
According to the locale, the thousands separator in Russian is U+2002 -- EN SPACE, the codepoint that requires more then one byte in UTF-8 encoding. But let's take a look on C++ numbers formatting provider: std::numpunct. We can see that member functions thousands_sep
returns single character. When in C locale definition, thousands separator represented as a string, so there is no limitation of single character as in C++ standard class.
This was just a simple and easily reproducible problems with C++ standard locale facets. There much more:
std::time_get
-- is not symmetric withstd::time_put
(as it in C strftime/strptime) and does not allow easy parsing of times with AM/PM marks.std::ctype
is very simplistic assuming that toupper/tolower can be done on per-character base (case conversion may change number of characters and it is context dependent).std::collate
-- does not support collation strength (case sensitive or insensitive).- There is not way to specify a timezone different from global timezone in time formatting and parsing.
- Time formatting/parsing always assumes Gregorian calendar.
Its very frustrating that in 2009 such annoying, easily reproducible bugs exist and make localization facilities totally useless in certain locales.
All the work I had recently done with support of localization in CppCMS framework had convinced me in important decision --- ICU would be mandatory dependency and provide most of localization facilities by default, because native C++ localization is no-go...
The question is: "Would C++0x committee revisit localization support in C++0x?"
Unicode in 2009? Why is it so hard?
From my point of view, one of the most missing features in C++ is the lack of good Unicode support. C++ provides some support via std::wstring and std::locale, but it is quite limited for real live purposes.
This definitely makes the life of C++ (Web) Developers harder.
However there are several tools and toolkits that provide such support. I had checked 6 of them: ICU library with bindings to C++, Java and Python, Qt3 and Qt4, glib/pango and native support of Java/JDK, C++ and Python.
I did little bit challenging test for correctness:
- To Upper: Is German ß converted to SS?
- To Lower: Is Greek Σ converted to σ in the middle of the word and to ς at its end?
- Word Boundaries: Are Chinese 中文 actually two words?
Basic features like encoding conversions and simple case conversion like "Артём" (my name in Russian) to "АРТЁМ" worked well in all tools. But more complicated test results were quite bad:
Results
Tookit | To Upper Case | To Lower Case | Word Boundaries |
---|---|---|---|
C++ | Fail | Fail | No Support |
C++/ICU | Ok | Ok | Ok |
C++/Qt4 | Ok | Fail | Ok |
C++/Qt3 | Fail | Fail | No Support |
C/glib+pango | Ok | Ok | Fail |
Java/JDK | Ok | Ok | Fail |
Java/ICU4j | Ok | Ok | Ok |
Python | Fail | Fail | No Support |
Python/PyICU | Ok | Ok | Ok |
Description
ICU: Provides great support but... it has very unfriendly and old API in terms of C++ development. The documentation is really bad.
Qt4: Gives good results and friendly API, has great documentation, but as we can see, some tests are failed. Generally, useful in web projects.
Qt3: Provides very basic Unicode support, no reason to use any more, especially when Qt4.5 is released under LGPL.
C++/STL: Even basic support exists, the API is not too friendly to STL containers and requires explicit usage of char *
or wchar_t *
and manual buffers allocation.
Glib: Gives quite good basic functionality. But finding word boundaries with Pango is really painful and does not work with Chinese. It has very nice C API and quite well documented. It uses internally utf-8 which makes the life easier when working with C strings. It still requires wrapping its functionality with C++ classes or grabbing huge GtkMM.
Python: has very basic native Unicode support. PyICU has terrible documentation.
Java: JDK provides quite good Unicode support, it can be quite easily replaced by ICU4J (actually most of JDK is based on ICU).
Summary
It is a shame that in 2009 there is no high quality, well documented, C++ friendly toolkit to work with Unicode.
- For real purposes I would take QtCode part of Qt4 or wrap ICU library with friendly API.
- Glib is good as well and, what is very important is its high availability on most UNIX systems.
When there will be Boost.ICU or Boost.Unicode just like there is Boost.Math or Boost.Asio?
Thread Safe Implementation of GNU gettext
There is widely available software internationalization tool called GNU gettext. Is is used as base for almost all FOSS software tools. It has binding to almost every language and supports many platforms including Win32.
How does it works? In any place you need to display a string that may potentially show in other language then English you just write:
printf(gettext("Hello World\n"));
And you get the required translation for this string (if available).
In 99% of cases this is good enough. However, as you can see, there is no parameter "target language". It is defined for entry application.
What happends if you need to display this string in different languages? You need to switch locale, and this operation is not thread safe. In most of cases you do not need to do this, because almost all applications will "talk" in single language that user had asked. However this is not the case of web based applications.
Certain web application allow you to display content in several languages: think of government site that should display information in three languages: Hebrew, Arabic and English. So you may need to define the translation per each session you open or use.
So, if you write a multithreaded FastCGI application that supports different languages is signle instance you can't use gettext.
more...