White Paper: Internationalization in GTK+


< Prev Contents Next >

Terminology

There are a number of terms that are commonly used when discussing writing software to be used in international settings. First, the terms internationalization and localization refer to the process of making software support a range of languages, and to the process of adapting the messages and conventions of a program to those of a particular locale, respectively. These terms are often abbreviated i18n and l10n respectively, after the number of letters between the first and last letters of the word.

The locale is the set of settings for the user's country and/or language. It is usually specified by a string like en_UK. The first two letters identify the language (English) the second two the country (the United Kingdom). Included in the locale is information about things like the currency for the country and how numbers are formatted, but, more importantly, it describes the characters used for the language. The character set is the set of characters used to display the language. When storing characters in memory or on disk, a given character set may be stored in different ways - the way it is stored is termed the encoding.

Handling international text is complicated by the fact that the encoding (especially for languages with large character sets, like the Asian languages) may be somewhat different than that used for English or European text - each character does not fit into a single byte. (Since there are more than 256 characters in the character set).

There are two basic strategies for dealing with such characters. In a multi-byte encoding, each character is represented as a variable number of bytes. As an example of such an encoding, in the commonly used EUC encoding, bytes less than 128 are simply ASCII characters, while bytes bytes greater than 128 are taken in pairs to represent extended portions of the character set. Since multi-byte encodings are usually backwards compatible with ASCII they are convenient to handle for programs that just want to use strings opaquely. However, because each character is a different number of bytes, it is difficult if a program needs to look at the bytes of the string one-by-one. In wide-character encodings, every character is the same width. (For instance, each character is two bytes.) Wide character strings are generally easier to manipulate, but have poor backwards compatibility.


< Prev Contents Next >