Unicode in 2009? Why is it so hard?

4/15/09, by artyom ; Posted in: Framework, Unicode and Localization; 8 comments

From my point of view, one of the most missing features in C++ is the lack of good Unicode support. C++ provides some support via std::wstring and std::locale, but it is quite limited for real live purposes.

This definitely makes the life of C++ (Web) Developers harder.

However there are several tools and toolkits that provide such support. I had checked 6 of them: ICU library with bindings to C++, Java and Python, Qt3 and Qt4, glib/pango and native support of Java/JDK, C++ and Python.

I did little bit challenging test for correctness:

To Upper: Is German ß converted to SS?
To Lower: Is Greek Σ converted to σ in the middle of the word and to ς at its end?
Word Boundaries: Are Chinese 中文 actually two words?

Basic features like encoding conversions and simple case conversion like "Артём" (my name in Russian) to "АРТЁМ" worked well in all tools. But more complicated test results were quite bad:

Results

Tookit	To Upper Case	To Lower Case	Word Boundaries
C++	Fail	Fail	No Support
C++/ICU‎	Ok	Ok	Ok
C++/Qt4‎	Ok	Fail	Ok
C++/Qt3‎	Fail	Fail	No Support
C/glib+pango	Ok	Ok	Fail
Java/JDK	Ok	Ok	Fail
Java/ICU4j	Ok	Ok	Ok
Python	Fail	Fail	No Support
Python/PyICU‎	Ok	Ok	Ok

Description

ICU: Provides great support but... it has very unfriendly and old API in terms of C++ development. The documentation is really bad.

Qt4: Gives good results and friendly API, has great documentation, but as we can see, some tests are failed. Generally, useful in web projects.

Qt3: Provides very basic Unicode support, no reason to use any more, especially when Qt4.5 is released under LGPL.

C++/STL: Even basic support exists, the API is not too friendly to STL containers and requires explicit usage of char * or wchar_t * and manual buffers allocation.

Glib: Gives quite good basic functionality. But finding word boundaries with Pango is really painful and does not work with Chinese. It has very nice C API and quite well documented. It uses internally utf-8 which makes the life easier when working with C strings. It still requires wrapping its functionality with C++ classes or grabbing huge GtkMM.

Python: has very basic native Unicode support. PyICU has terrible documentation.

Java: JDK provides quite good Unicode support, it can be quite easily replaced by ICU4J (actually most of JDK is based on ICU).

Summary

It is a shame that in 2009 there is no high quality, well documented, C++ friendly toolkit to work with Unicode.

For real purposes I would take QtCode part of Qt4 or wrap ICU library with friendly API.
Glib is good as well and, what is very important is its high availability on most UNIX systems.

When there will be Boost.ICU or Boost.Unicode just like there is Boost.Math or Boost.Asio?

Comments

Stu, at 5/1/09, 6:10 PM

How about ustring (as used by Gtkmm)?

artyom, at 5/1/09, 7:46 PM

Actually Gtkmm ustring is wrapper of Glib string as has all its features, however, Gtkmm is much heavier then Glib because it includes as dependency all libraries of GTK and not only glib.

Also, it hadn't good API for word boundaries, so I actually should use C Pango API for this purposes.

Marius, at 6/3/09, 2:59 PM

Actually I'm working with QT4 right now and it seems nice, has lots of nice features and you certainly can't complain when it comes to the documentation. Plus , Qt 4.5 is LGPL so that makes things easier. If it did fail it's probably a bug so you should probably report it so it can get fixed.

artyom, at 6/3/09, 8:11 PM

Actually, Qt4 was quite fine, but you still need to remember that I did only basic tests...

Such things are usually quite hard to fix and I had pointed only at specific well know problem, how many others exist?

Andrew, at 7/19/09, 1:01 AM

Glib::ustring is part of Glibmm, not Gtkmm.

In raw C++ I typically use libiconv with std::vector<uint8_t> (for UTF-8) and std::vector<utf16_t> (for UTF-16) to handle the buffers.

tinytin, at 4/29/13, 4:50 PM

What does "Word Boundaries" mean? Natural language word segmentation?

Which class/function of QT4/ICU support it?

artyom, at 4/30/13, 11:15 PM

What does "Word Boundaries" mean? Natural language word segmentation?

Yes

Which class/function of QT4/ICU support it?

ICU: BreakIterator, Qt4: QTextBoundaryFinder

tinytin, at 5/12/13, 9:14 AM

Very good! thanks.

Add Comment:

Author
E-Mail	the email would not displayed
URL

You can write your messages using Markdown syntax.

You must enable JavaScript in order to post comments.

Project

Some rights reserved, the content of this blog is available under Creative Commons Attribution License 2.5 Israel.

Creative Commons