[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Bug 243541] default encoding is ascii, should be UTF-8, produces exceptions for i18n applications

Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug.


Dave Malcolm <dmalcolm redhat com> changed:

           What    |Removed                     |Added
            Version|9                           |12

--- Comment #3 from Dave Malcolm <dmalcolm redhat com>  2009-12-14 15:50:37 EDT ---
John: I spent some time reviewing this today; here are my notes:

Looking over the source history in upstream's Subversion:
  -  the site.py hook to set the default encoding from the locale was added on
June 7th 2000 in rev 15634:
'Added support to set the default encoding of strings
at startup time to the values defined by the C locale...'
  - http://svn.python.org/view?view=rev&revision=15634

  - the code was disabled by default 5 weeks later on July 15th 2000 in rev
16374 by effbot (Fredrik Lundh):
-- changed default encoding to "ascii".  you can still change
   the default via site.py...:

  - and the code was optimized two months later on Sept 18th 2000 in rev 17513,
to only set it if it's changed:

Looking over upstream mailing list archives for this period:
[Python-Dev] changing the locale.py interface?: Fredrik Lundh
<effbot telia com>
followed by:
http://mail.python.org/pipermail/python-dev/2000-July/005954.html "ascii
default encoding":
(unfortunately side-tracked into a debate of "deprecated" vs "depreciated"); I
may have missed some of the discussion though.

The actual affect of calling: sys.setdefaultencoding:
It is defined in Python/sysmodule.c, it calls
PyUnicode_SetDefaultEncoding(encoding) on the string "encoding"
PyUnicode_SetDefaultEncoding is defined in Objects/unicodeobject.c; it has this
    /* Make sure the encoding is valid. As side effect, this also
       loads the encoding into the codec registry cache. */
    v = _PyCodec_Lookup(encoding);
then copies the encoding into the buffer: "unicode_default_encoding"; this
buffer supplies the return value for PyUnicode_GetDefaultEncoding(), which is
used in many places inside the unicode implementation, plus in
bytearrayobject.c: bytearray_decode()
 and in stringobject.c: PyString_AsDecodedObject()
so it would seem that there's at least some risk in changing this setting.

To add to the confusion, Py_InitializeEx sets up the encoding of each of
stdout, stderr, stdin to the default locale encoding (UTF-8), _provided_ they
are connected to a tty:
#0  PyFile_SetEncodingAndErrors (f=0xb7fc5020, enc=0x80edc28 "UTF-8",
errors=0x0) at Objects/fileobject.c:458
#1  0x04fbdd49 in Py_InitializeEx (install_sigs=<value optimized out>) at
#2  0x04fbe29e in Py_Initialize () at Python/pythonrun.c:359
#3  0x04fc9886 in Py_Main (argc=<value optimized out>, argv=<value optimized
out>) at Modules/main.c:512
#4  0x080485c7 in main (argc=<value optimized out>, argv=<value optimized out>)
at Modules/python.c:23

which means that a simple case (printing lower case greek alpha, beta, gamma)
works when run directly:
[david brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
>>> sys.getdefaultencoding()
>>> sys.stdout.encoding
>>> sys.stderr.encoding

...but fails if you pipe it to a file or redirected into "less":
python -c 'print u"\u03b1\u03b2\u03b3"' > foo.txt
[david brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)

Configure bugmail: https://bugzilla.redhat.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]