Issue #1 November 2004

Code Internationalization 101

Why Internationalize?

Most people speak, read, and write only one language fluently. A few more of us are bilingual. Which is probably why open source developers who have never attempted to support internationalization before are understandably intimidated by the number of languages spoken by the open source community.

Fortunately, internationalization technology within open source has progressed to the point where most of the heavy lifting has already been done. In particular, software internationalization has developed so that the most intuitive division of labor point — is, the labor of translating and the labor of software engineering, can be cleanly divided. The maintainers of the software are responsible for using software tools to separate the text and other translatable or localizable material from the rest of the source code. Contributors then translate this material, and provide the translated text back to the maintainer. While the maintainer might attempt to translate the project into one or maybe even two languages, a project's ability to branch out to other languages and locales is really dependent on the translation contributors. Thus, for this relationship to work well, it is very important for the maintainer to keep the localizable data constantly updated and merged and make that data easily accessible to translation contributors.

Internationalization may initially sound like a politically correct lets-not-leave-anybody-out-good-karma philosophy, but it is also good for business and good for the quality of the code within an open source project. Major software companies figured out long ago that properly internationalizing a project results in a logarithmic cost graph; that is, while the initial cost of making a product support multiple languages is higher than that of an application which does not consider internationalization during the product phase, the addition of languages further down the product life cycle is much lower. In contrast, an attitude of “worry about foreign languages later” usually results in a linear cost with the addition of each language. Sometimes the cost is even worse than linear once large character set (for example, Chinese, Japanese, and Korean) and complex (Hebrew and Arabic) languages enter the picture. Retrofitting internationalization later on during the cycle can be an extremely difficult procedure, requiring extensive rewrites to use internationalization-aware functions within libraries.

Even free software projects with no business revenue stand to benefit from internationalization. Adding internationalization makes the software accessible to a far greater user base. Even if all developers read and write English due to the origin and roots of the most popular computer languages and tools (which is false), developers are more motivated to contribute to and participate in a project that is accessible to the users within their own communities or locales.

The Low Hanging Fruit: Translating Strings

The three most common areas in software where internationalization needs to be considered are (in increasing order of difficulty): output internationalization, input internationalization, and text/string processing. For the purposes of an introduction, this article concentrates on the easiest part of internationalization, the translating of the strings in the program.

Output is the first type of internationalization most consider when they decide to make a project multilingual because it is the most obvious change. A user interface that can display text in another language is impressive to most, and it provides the bragging screenshots used to prove to the world that the program is indeed global.

Fortunately, internationalizing a program's output is much easier than the input or the text and string processing, thanks to very well done tools and modern system libraries. The most difficult part of handling the output, the display of non-trivial languages (usually languages that use an alphabet beyond the Latin alphabet and accent marks), is no longer a problem on modern Linux-based operating systems. The keyword here is modern, though, and in Linux years libraries that are three years old can be considered ancient. Older open source libraries tend to have spotty support for Unicode — the character set that allows for character exchange and compatibility without requiring code forks to handle different character sets and encoding variations. Linux distributions such as Red Hat Linux 9 and Fedora Core 1 tend to use a complete Unicode stack. That is, everything from the kernel to the X Window System and the GUI (be it GNOME/GTK+ or KDE/Qt) natively understands Unicode. For output, usage of a modern version of the GUI is essential; much of the complexities regarding the right-to-left display and languages with thousands of characters are handled by this layer. Usually Unicode support is done by accepting strings encoded in UTF-8. UTF-8 is a multi-byte encoding that is 100% compatible with ASCII.

If the code that displays output is properly developed and uses modern libraries (in other words, always using the internationalized version of a function), the next step is to separate the string data from the code. This is trivial to do with tools such as gettext(), even with source that is already written so that strings are mixed with the code. gettext() is part of glibc, the GNU C library, and includes a tool which scans existing source code for uninternationalized strings and bundles them together in a easy-to-translate collection within one file. The most common complication that arises during the separation process is when the automated code modification process gets confused by indirection or changes a static string to a function call where such a change would not be permitted. Fortunately, gettext() has a method of dealing with this. The next most common problem is when programmers tie the grammar of a language to the code logic. For example, in English, it is not uncommon to see code that adds a “s” to the end of a noun when a number preceding it is not 1.

Necessary Packages for This Example

The following three packages (along with all of their dependencies) were used to create the Hello World example in this section. A Fedora Core Workstation installation already has these packages. A Fedora Core Desktop installation need to retrieve these packages, along with their required packages, using either up2date, yum, or Add/Remove Applications (redhat-config-packages).

  • gettext-0.12.1-1

  • libgnomeui-devel-2.8.0-1 (and all of its requirements)

  • gcc-3.2.2-2

Internationalizing the GNOME Hello World Application

To illustrate the very basics of internationalization, this section uses the Hello World program for GNOME, written in C. While a simple text console application that would have been less than five lines and easier to understand, a graphical Hello World program is a better real world example because the font and internationalized text logic in modern toolkits allows for access to more languages than a terminal-based application.

Our English-only GNOME application appears in Example 1, “English-only Hello World Application”.


/* hello-world.c */
#include <gnome.h>
int main(int argc, char *argv[]) {
  GtkWidget *window, *label;
  gnome_init("helloworld", "1.0", argc, argv);
  window = gnome_app_new("helloworld", "Hello World!");
  gtk_signal_connect(GTK_OBJECT(window), "delete_event", 
                     GTK_SIGNAL_FUNC(gtk_main_quit), NULL);
  label = gtk_label_new("Hello World!");
  gnome_app_set_contents(GNOME_APP(window), label);
  gtk_widget_show_all(window);
  gtk_main();
  return 0;
}

Example 1. English-only Hello World Application

Note that this Hello World program is missing many macros that would've been generated by autoconf and automake tools. While these tools are important for real world projects (especially for internationalization), some of the automation and abstraction that it provides is removed to make the example more clear. Because there is no makefile, compile the example from Example 1, “English-only Hello World Application” with the gcc command. Using a Fedora Core system (development packages installed) for example, execute the following command:

pkg-config --cflags --libs libgnomeui-2.0

The program itself basically provides three bits of functionality: display of a title, display of a label, and a reaction (quitting) to closing the window. There are two primary candidates for internationalization: the title of the program and the label. Note that the tool tips that may appear around the application are normally internationalized by the window manager, not the application.

Function Description

char *dgettext(const char *domain_name, const char *msgid)

Looks up and finds the msgstr for the given msgid, within the MO file specified by domain_name (plus the extension .mo). If the msgstr is found, it is returned in the encoding appropriate for the locale, as specified by LC_CTYPE. If it is not found, the msgid is returned, without any encoding conversion.

char *textdomain(const char *domain_name)

Specifies the name of the MO file to use when the domain is not explicitly provided as a parameter. The domain_name name consist of any valid filename character (anything except for the solidus aka slash). An extension of .mo is assumed. If NULL is passed, the current domain is returned (which is messages if it has not been previously set).

char *bindtextdomain(const char *domain_name, const char *path)

Sets the path prefix to search for the given domain name. The actual filename used is locale/LC_MESSAGES/domain_name.mo, where locale is determined by the run environment. If path is NULL and domain_name is not, the current path prefix is returned (the default is /usr/share/locale). Using relative paths (such as .. and .) is possible but not recommended.

char *gettext(const char *msgid)

Exactly the same as dgettext() except that the domain is assumed to be either messages or whatever has been most recently set with textdomain().

char *_(const char *msgid)

A macro simplification of gettext(). This is the preferred form as it is less visually obtrusive in the source code.

Table 1. Internationalization Functions

Modifying the Source

Only five line changes are needed for internationalization: the addition of three lines and the wrapping of two strings with the gettext() function, as displayed in Example 2, “Changes for Internationalization”.


#define ENABLE_NLS 1
#include <gnome.h>
int main(int argc, char *argv[]) {
  GtkWidget *window, *label;
  bindtextdomain("helloworld", "/usr/share/locale");
  textdomain("helloworld");
  gnome_init("helloworld", "1.0", argc, argv);
  window = gnome_app_new("helloworld", gettext("Hello World!"));
  gtk_signal_connect(GTK_OBJECT(window), "delete_event", 
                     GTK_SIGNAL_FUNC(gtk_main_quit), NULL);
  label = gtk_label_new(gettext("Hello World! How are you?"));
  gnome_app_set_contents(GNOME_APP(window), label);
  gtk_widget_show_all(window);
  gtk_main();
  return 0;
}

Example 2. Changes for Internationalization

The ENABLE_NLS (NLS is an abbreviation for National Language Support) macro exists to insure that the functions used to internationalize the program are properly defined by the included headers. Without it, no code would be generated by the C preprocessor for the other new functions. Normally, this definition is inside of the config.h file which is generated by the autoconf tool.

The textdomain() function is used to cause all subsequent gettext() function calls (explained below) to search for messages only within a particular domain. Without it, the gettext() calls search in the global domain, 'messages.' Actually, using textdomain() isn't absolutely necessary, even if the program isn't using the global domain 'messages,' because setting the domain by explicitly passing the text domain as a parameter to a variant of the gettext() function (either dgettext() or dcgettext()) is possible. But this is tedious; the dgettext() function is better suited for cases where overriding the current domain (for example, borrowing the translations from another package) is needed.

The bindtextdomain() function can further reduce ambiguity in the case where two or more programs use the same domain name by specifying the directory where the file corresponding to that domain resides. Without it, the system defaults to /usr/share/locale/. Strictly speaking, bindtextdomain() is redundant in this example because it is set to the default, but as real non-trivial projects use GNOMELOCALEDIR as the second parameter to bindtextdomain(), it should be considered a necessary part of the initialization process. Using an absolute path for bindtextdomain() is recommended.

Finally, all strings which need to be translated are wrapped with the gettext() function. The original English strings become keys for lookup within a compiled data file (called a MO file because the filenames usually end in the suffix .mo) that consists of nothing but keys and their translations. The key, known as the message-id, is the entire untranslated string (sometimes called a segment). If the message-id and translation pair does not yet exist in the MO file, the message-id itself is returned. Thus, if a string is yet to be translated, gettext() returns the untranslated phrase. Using untranslated phrases as message-ids is preferable to arbitrary identifiers because displaying the original program language for untranslated segments is better than displaying a meaningless symbol to the end-user. Also, using the original language as the key allows for the quick gettextizing of existing, non-internationalized code.

While it is technically possible to have the message-ids in any language and in any character set and/or encoding, it is good practice to ensure that message-ids are in ASCII only, which limits the original language to English. The reason for this limitation has to do with gettext()'s ability to convert the character set/encoding of the translated phrase on the fly as appropriate for the current locale. Though useful in that it eliminates any concern over whether the character set used by the translator and the programmer match, gettext() does not convert the encoding of the message-id when it is returned in response to not finding a translation within a MO file. ASCII is usually safe because most character encodings/sets such as UTF-8 and Latin-1 are backwards compatible with ASCII.

Now that the gettext() function has been explained, it should be noted that in real code, the lengthy identifier gettext is almost never used; instead, its macro shortcut, the underscore, is used.

Thus, the portion of the Hello World program containing the gettext() calls would normally appear as the code shown in Example 3, “gettext() Calls as Underscores”.


window = gnome_app_new("helloworld", _("Hello World!"));
gtk_signal_connect(GTK_OBJECT(window), "delete_event", 
                   GTK_SIGNAL_FUNC(gtk_main_quit), NULL);
label = gtk_label_new(_("Hello World! How are you?"));

Example 3. gettext() Calls as Underscores

Extracting the Messages

Once the source has been peppered with one or more gettext() function calls, the xgettext() program is used to create a master resource template — called a POT file (although the default filename used by xgettext() is messages.po, not messages.pot). This template, which consists of nothing but message-ids (the untranslated phrases) and translated strings (which begin blank), is used to create one PO (an abbreviation for Portable Object) file for every language.

To see xgettext() in action with our internationalized Hello World example, execute the following command:


xgettext hello-world.c

xgettext() produces a file called messages.po in the current directory (because the default domain is messages) as shown in Example 4, “messages.po Produced by xgettext()”.


# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2004-03-01 06:31-0500\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"

#: hello-world.c:16
msgid "Hello World!"
msgstr ""

#: hello-world.c:20
msgid "Hello World! How are you?"
msgstr ""

Example 4. messages.po Produced by xgettext()

After the option comments is the header, which must have a msgid of “”. Note that like C, double-quoted strings separated by nothing but whitespace are concatenated together, so all the lines from Project-Id-Version to Content-Transfer-Encoding is really just one big long string (with embedded newlines represented with the C escape sequence \n) for the first msgstr. It is possible to add additional headers (some PO file editing programs do this). It is also possible to remove some of headers, although the program that transforms the PO files into MO files, msgfmt, complain about this if it checks the headers.

The most important part of the header is the value of CHARSET. It needs to be something that is recognized by iconv (determined by the command iconv -l) and correct. Correctness means that the contents in all of the msgstr sections are actually encoded in the specified character set. This is harder than expected, considering many people from remote locations using many different editors and other translation software may contribute translations.

Tip:
Using UTF-8 is advantageous because compared to many legacy encodings such as Latin-1 and Latin-2, it is easy to detect non-UTF-8 that may get accidentally inserted. While glibc's conversion libraries detect encoding level errors as well as references to non-characters when using UTF-8, it can't detect more advanced (but rare) flaws such as invalid combining sequences or text in the wrong normalization form. Utilities such as W3C's charlint can detect and fix this.

As an example, a phony “Japanese translation” of messages.po, called ja.po, is displayed in Example 5, “Japanese Translations”.


# Translations for simple GNOME Hello World app (ja.po)
# Copyright (C) 2004 Red Hat, Inc.
# Adrian Havill <havill@redhat.com>, 2004.
msgid ""
msgstr ""
"Project-Id-Version: helloworld 1.0\n"
"POT-Creation-Date: 2004-03-01 03:13-0500\n"
"MIME-Version: 1.0\n"
"PO-Revision-Date: 2004-03-01 03:13-0500\n"
"Last-Translator: Adrian Havill <havill@redhat.com>\n"
"Language-Team: Japanese <ja@li.org>\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#: hello-world-i18n.c:11
msgid "Hello World!"
msgstr "Konnichiwa!"

#: hello-world-i18n.c:15
msgid "Hello World! How are you?"
msgstr "Konnichiwa! Genki desu ka?"

Example 5. Japanese Translations

The msgstr entries for the two lines would normally be in native Japanese characters, but for the sake of a simple example, the translations are transcribed in Latin letters. Note that the fuzzy directive (along with the comma preceding it) that was present in the original POT file was removed. The presence of the term fuzzy in a comment above a message-id is xgettext()'s way of notifying the translator that the msgstr is probably not acceptable as-is and needs to be fixed. The only fuzzy in the example PO file is the header, though, because this is a new template and contains no merged strings from an existing project. Without an argument telling it to do otherwise, the msgfmt program that generates MO files ignores msgid/msgstr pairs that are prefixed with the fuzzy flag in the comment preceding it.

The following command transforms the PO file into a MO file that the gettext() system can use:

msgfmt --check --statistics ja.po

Because there is no output filename specified, the default domain is used to create the file messages.mo. The –-check and –-statistics options are not necessary, but they are useful for catching simple errors and for giving an idea of how much is translated and how much work remains to be done.

Installing the Machine Readable Translations

The MO file that msgfmt produced should normally be renamed to the domain name that is used by the program. For most programs created and maintained with autoconf, this means the value of the macro PACKAGE. In our Hello World example, textdomain() is called with the value helloworld. So, the messages.mo file should be installed and world-readable. As root, execute the following command:

cp -i messages.mo /usr/share/locale/ja/LC_MESSAGES/helloworld.mo

/usr/share/locale/ is the standard location for MO files (providing it is not changed with bindtextdomain()). ja is the standardized locale name for Japanese (locales set to ja_JP and ja_JP.UTF-8 also use this directory). The LC_MESSAGES directory name never changes.

Testing the New Translations

Many environment variables affect the displayed translations. The most reliable way to find out which locale a system is currently using for LC_MESSAGES and LC_CTYPE is to execute the locale command:

locale

LC_MESSAGES determines which MO file is used. LC_CTYPE, while normally used to classify characters and determine encoding type, is also used by GNOME to determine how real fonts map to aliases such as Sans. Normally these locale settings are all set to the same value. This is either done by setting the LANG environment variable, which defines the locale for any unset environment variable beginning with LC_, or by setting LC_ALL, which takes precedence over LANG.

Because our artificial Japanese example does not actually use real Japanese characters but rather Latin characters, the font can be changed and still be able to view the messages with the following combination of commands:

LC_MESSAGES=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 ./hello-world &
LC_MESSAGES=ja_JP.UTF-8 LC_CTYPE=ja_JP.UTF-8 ./hello-world &
LC_MESSAGES=en_US.UTF-8 LC_CTYPE=ja_JP.UTF-8 ./hello-world &
LC_MESSAGES=ja_JP.UTF-8 LC_CTYPE=en_US.UTF-8 ./hello-world &

Rather than set the LC_ variables individually for the sake of demonstration, it is more common to just set the LANG environment variable.

Working With Gettext in Actual Practice

The examples in this article show how gettext() works at a basic level. Most of the time, people working with gettext() do so through existing projects. Most open source projects that use automake and similar tools automatically update PO files from sources, merging and preserving existing changes as the project evolves. Also, many good tools such as gTranslator, KBabel, and Emacs exist for working efficiently with PO files. The fact that open source software which is not gettext() enabled is becoming more and more uncommon is due to its ease of use and implementation. Once the basics are understood, learning the remainder of the details and the efficiency tools should not be difficult at all.

About the Author

Adrian Havill currently works as a software engineer for Red Hat — he started in Tokyo, Japan and later moving to Raleigh, NC. His favorite area of development is internationalization. In his spare time he enjoys ice hockey and spending time with bilingual family.