Unicode in 2009? Why is it so hard?
From my point of view, one of the most missing features in C++ is the lack of good Unicode support. C++ provides some support via std::wstring and std::locale, but it is quite limited for real live purposes.
This definitely makes the life of C++ (Web) Developers harder.
However there are several tools and toolkits that provide such support. I had checked 6 of them: ICU library with bindings to C++, Java and Python, Qt3 and Qt4, glib/pango and native support of Java/JDK, C++ and Python.
I did little bit challenging test for correctness:
- To Upper: Is German ß converted to SS?
- To Lower: Is Greek Σ converted to σ in the middle of the word and to ς at its end?
- Word Boundaries: Are Chinese 中文 actually two words?
Basic features like encoding conversions and simple case conversion like "Артём" (my name in Russian) to "АРТЁМ" worked well in all tools. But more complicated test results were quite bad:
Results
Tookit | To Upper Case | To Lower Case | Word Boundaries |
---|---|---|---|
C++ | Fail | Fail | No Support |
C++/ICU | Ok | Ok | Ok |
C++/Qt4 | Ok | Fail | Ok |
C++/Qt3 | Fail | Fail | No Support |
C/glib+pango | Ok | Ok | Fail |
Java/JDK | Ok | Ok | Fail |
Java/ICU4j | Ok | Ok | Ok |
Python | Fail | Fail | No Support |
Python/PyICU | Ok | Ok | Ok |
Description
ICU: Provides great support but... it has very unfriendly and old API in terms of C++ development. The documentation is really bad.
Qt4: Gives good results and friendly API, has great documentation, but as we can see, some tests are failed. Generally, useful in web projects.
Qt3: Provides very basic Unicode support, no reason to use any more, especially when Qt4.5 is released under LGPL.
C++/STL: Even basic support exists, the API is not too friendly to STL containers and requires explicit usage of char *
or wchar_t *
and manual buffers allocation.
Glib: Gives quite good basic functionality. But finding word boundaries with Pango is really painful and does not work with Chinese. It has very nice C API and quite well documented. It uses internally utf-8 which makes the life easier when working with C strings. It still requires wrapping its functionality with C++ classes or grabbing huge GtkMM.
Python: has very basic native Unicode support. PyICU has terrible documentation.
Java: JDK provides quite good Unicode support, it can be quite easily replaced by ICU4J (actually most of JDK is based on ICU).
Summary
It is a shame that in 2009 there is no high quality, well documented, C++ friendly toolkit to work with Unicode.
- For real purposes I would take QtCode part of Qt4 or wrap ICU library with friendly API.
- Glib is good as well and, what is very important is its high availability on most UNIX systems.
When there will be Boost.ICU or Boost.Unicode just like there is Boost.Math or Boost.Asio?
Comments
How about ustring (as used by Gtkmm)?
Actually Gtkmm ustring is wrapper of Glib string as has all its features, however, Gtkmm is much heavier then Glib because it includes as dependency all libraries of GTK and not only glib.
Also, it hadn't good API for word boundaries, so I actually should use C Pango API for this purposes.
Actually I'm working with QT4 right now and it seems nice, has lots of nice features and you certainly can't complain when it comes to the documentation. Plus , Qt 4.5 is LGPL so that makes things easier. If it did fail it's probably a bug so you should probably report it so it can get fixed.
Actually, Qt4 was quite fine, but you still need to remember that I did only basic tests...
Such things are usually quite hard to fix and I had pointed only at specific well know problem, how many others exist?
Glib::ustring is part of Glibmm, not Gtkmm.
In raw C++ I typically use libiconv with
std::vector<uint8_t>
(for UTF-8) andstd::vector<utf16_t>
(for UTF-16) to handle the buffers.What does "Word Boundaries" mean? Natural language word segmentation?
Which class/function of QT4/ICU support it?
Yes
ICU: BreakIterator, Qt4: QTextBoundaryFinder
Very good! thanks.
Add Comment:
You must enable JavaScript in order to post comments.