[Freeipa-devel] python-ldap has unicode issues

John Dennis jdennis at redhat.com
Mon Aug 20 18:11:19 UTC 2007


On Fri, 2007-08-17 at 18:40 -0400, John Dennis wrote:
> I'm trying to wrap up for the day now so I'm being brief, but on Monday
> I can work with you to solve these issues, dig up my documentation, etc.

As promised here is the documentation, I hope it helps. I had to write
this up this morning from scattered notes I kept elsewhere. This is
probably worth posting someplace, it took me a long time to unravel it.


-- 
John Dennis <jdennis at redhat.com>

-------------- next part --------------
Python Internationalization (i18n) Issues

John Dennis (jdennis at redhat.com)


Overview:

i18n strings require integers whose numeric range is large in order to
encode their characters (code points) from a large set of numeric
values. There are different ways to encode code points from a code set
(the unicode code set is an international standard). An obvious
solution is to increase the width of the integer containing a
character from the traditional 1 byte character representation to 2 or
4 bytes per character. When unicode characters (code points) are
encoded in 2 or 4 byte integers those encodings are called UCS-2 and
UCS-4 respectively.

However, it is also possible to encode characters outside the range of
1 byte ASCII in a single byte sequence. UTF-8 is such an encoding and
is popular because traditional 1 byte ASCII is a proper subset of
UTF-8. When it is necessary to encode a character whose code point
requires a larger integer UTF-8 will use multiple adjacent bytes in
the string sequence. This means the number of characters in a UTF-8
encoded string may not equal it's byte length and simple indexing is
not possible. On the other hand encoding which use wide characters
such as UCS-2 and UCS-4 retain the property that the number of
characters equals the sequence length and simple indexing may be used
to access individual characters. UTF-8 is a popular unicode encoding
because it's common for strings to contain only ASCII characters thus
making storage and processing efficient for a common case while
allowing for the expanded character set of unicode when necessary.

How Python deals with i18n strings:

Python has two builtin types which can contain strings, 'str' which is
a conventional byte sequence where each byte is presumed to contain a
character and 'unicode' which depending on how python was compiled is
implemented using wide 2 or 4 byte characters (UCS-2, UCS-4
respectively). The Red Hat (Fedora) Python uses UCS-4 for unicode thus
a single unicode character is represented as a 32-bit integer.

As is typical when dealing with i18n issues Python has confused the
vocabulary of i18n. Python's 'unicode' object is a character
*encoding* because it describes how a code point will be represented
(e.g. 2 or 4 bytes). Interpreting the numeric code point to know what
character that integer value represents requires knowing the
code set. Unicode is a code set which may be encoded in a number of
popular encodings such as UCS-2, UCS-4, UTF-8, and others. Python is
making an assumption it's wide character encodings will contain
characters from the unicode code set. The fact a character is encoded
in 2 or 4 bytes does not in and of itself imply the code set is
unicode just as the fact a character is encoded in a single byte does
not mean the code set is ASCII (i.e. could be EBCEDIC or UTF-8).

There are two fundamental ways a i18n string can enter a python
application, either hard coded via the 'u' unicode type coercion
(e.g. u'some i18n string') or most commonly by looking up a i18n
string in a translation catalog via the gettext package using the
_() function (e.g. _('some i18n string').

It is possible for gettext to return strings in a variety of
encodings. By default gettext will return strings in the unicode
encoding Python was compiled with (UCS-2 or UCS-4). However gettext
can be configured to return strings in other encodings as well, such
as UTF-8 (note bad vocabulary, the gettext parameter which controls
the encoding is called 'codeset', but encoding != codeset)

Recall Python has 2 ways of representing a character string, str
objects which encode each character as a single byte and unicode
objects which encode each character as 2 or 4 bytes (UCS-2 or
UCS-4). When Python outputs a unicode string object it has to decide
what encoding it will translate the string to prior to output. str
objects are not subject to output encoding translation, it is assumed
a str object is a byte sequence (possibly a character string but not
necessarily) which is not subject to interpretation. However unicode
objects most certainly contain i18n strings which might need to be
re-encoded to match the character encoding expected by the receiver.

Note: the term "string output" is used whenever Python passes a string
in Python's internal representation to a non-Python interface. This
occurs when calling external libraries.

Whenever python outputs a unicode string it will attempt to convert it
to the default encoding set in site.py. It is not possible for a
python application to set the default encoding, this is prohibited (it
is not known why it is prohibited for an application to reset the
default encoding). In many python implementations the default encoding
is set to ASCII :-( Thus when python attempts to output a unicode
string (UCS-2 or UCS-4) it will in try to apply the default encoding
to it (typically ASCII) and the translation will fail because many
wide UCS code points (characters) lie outside the aacii numeric range.

When Python outputs a string (including when Python passes that string
to another library) the encoding must match what the receiver expects
or the string will not be understood. Here are the possible options:

1) use str objects encoded in ASCII, output will be ASCII

2) use str objects encoded in UTF-8, output will be UTF-8

3) use unicode objects and depend on the default encoding translation,
output will be in the default encoding (set in site.py, usually
ASCII).

4) use unicode objects and manually encode the output by explicitly
calling string.encode().

Here are the down sides of the above options:

Option 1 does not support i18n so it's a non-starter for i18n
applications.

Option 3 fails with most i18n strings because the default encoding is
wrong and the default encoding can't be changed by the application.

Option 4 should be avoided because it depends on programmer diligence
to always remember to encode the string, to known when to encode the
string, and clutters up the source diminishing readability.

Knowing which encoding to utilize on output requires knowing what the
receiver expects. Only option 4 (manual use of string.encode()) allows
a different encoding per receiver. If all the external libraries the
python application "links" with expect the same encoding then things
get much simpler because one can rely on a global encoding
scheme. However if different receivers expect different encodings one
is forced into option 4, but complicated by the fact one needs to know
the required encoding each time string.encode() is called. Yuck!

A common scenario is for a python application to link with external
entities all of which expect UTF-8 (common because some Linux
distributions adopt a UTF-8 convention). In this case we need to
assure strings we output are encoded in UTF-8. There are two ways to
accomplish this:

1) set the default encoding to UTF-8 and internally use wide unicode
strings. But as of this writing one cannot set the default encoding in
an application, thus this is not really an option. By default gettext
returns unicode strings, thus Python programmers get confused because
two of Python's defaults are in direct conflict with one another;
gettext returns unicode, on output those unicode strings are
automatically translated, which by default is ASCII which often throws
an encoding exception because the unicode can't be converted to
ASCII).

2) internally use UTF-8, not unicode. Thus all i18n strings will be
conventional byte orientated 'str' objects, not wide unicode
(UCS-*). Python will happily pass these UTF-8 strings around as plain
strings and because they are plain strings will not attempt to apply
an output encoding translation to them. Thus on output an i18n string
encoded in UTF-8 remains UTF-8! The downside is len() no longer
returns the correct number of characters (if there are multibyte
characters in the string) and it's difficult to apply basic string
operations (e.g. concatenation). However, it's not common to need to
perform such string operations on i18n strings originating from an
i18n translation catalog.

Our adopted solution is 2. We eschew use of unicode strings, all
strings are represented as 'str', not unicode and are encoded in
UTF-8. We instruct gettext to not return translations via _() in
unicode, but rather in UTF-8 by specifying the gettext codeset to be
UTF-8. A consequence of this means any i18n strings which are not
obtained by translation catalog look up must use
string.encode('utf-8').

Using _() in Python:

In python i18n applications the _() function is used to look up the
translated string. There are two basic approaches for making the _()
function available.

1) Globally. The _() function is installed in Python's built in
name space. Every module will see this function unless it is overridden
by local scope. gettext.install() installs _() in the builtin
name space. This is appropriate for applications with a single
translation encoding because of it's global effect.

Example:

import gettext
gettext.install(domain    = 'pkgname',		# translation catalog name
                localedir = '/usr/share/locale',# where translation catalog is installed
                unicode   = False,		# do not return translation as Python unicode string object
						# otherwise on output it will be encoded to the site encoding
                codeset   = 'utf-8')		# return translated string in UTF-8 encoding,
						# note unfortunate parameter name, it's not a codeset, it's an encoding
        
2) Per Module. The _() function is bound to the module's name space, it
is available only to callers within the module. This is appropriate
for libraries (i.e. modules) because it is local to the module and
will not affect the application which imports the module. Not only
does this restrict the encoding to the module but it also makes the
translation catalog local to the module, thus a module can have it's
own translations independent of the main application.

Example:

import gettext
_ = gettext.translation(domain    = 'pkgname',		# translation catalog name
                        localedir = '/usr/share/locale',# where translation catalog is installed
                        fallback=True			# untranslated strings do not generate exception
                        ).lgettext





More information about the Freeipa-devel mailing list