|
There are a number of terms that are commonly used when
discussing writing software to be used in international
settings. First, the terms internationalization
and localization refer to the
process of making software support a range of languages,
and to the process of adapting the messages and conventions
of a program to those of a particular locale, respectively.
These terms are often abbreviated i18n and l10n respectively,
after the number of letters between the first and last
letters of the word.
The locale is the set of settings
for the user's country and/or language. It is usually
specified by a string like en_UK.
The first two letters identify the language (English)
the second two the country (the United Kingdom).
Included in the locale is information about things
like the currency for the country and how numbers are
formatted, but, more importantly, it describes
the characters used for the language. The
character set is the set of
characters used to display the language.
When storing characters in memory or on disk, a given
character set may be stored in different ways - the
way it is stored is termed the encoding.
Handling international
text is complicated by the fact that the encoding
(especially for languages with large character sets,
like the Asian languages) may be somewhat different
than that used for English or European text - each
character does not fit into a single byte. (Since
there are more than 256 characters in the character
set).
There are two basic strategies for dealing with
such characters. In a multi-byte
encoding, each character is represented as
a variable number of bytes. As an example of such
an encoding, in the commonly used EUC encoding,
bytes less than 128 are simply ASCII characters,
while bytes bytes greater than 128 are taken in
pairs to represent extended portions of the character
set. Since multi-byte encodings are usually backwards
compatible with ASCII they are convenient to handle
for programs that just want to use strings opaquely.
However, because each character is a different number
of bytes, it is difficult if a program needs to
look at the bytes of the string one-by-one.
In wide-character encodings,
every character is the same width. (For instance, each
character is two bytes.) Wide character strings
are generally easier to manipulate, but have poor
backwards compatibility.
|