[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Non UTF-8 charset fallback support in GLib (Was Re: plans for long term support releases?)



At 09:31 PM 1/17/2007, Bruno Wolff III wrote:
On Wed, Jan 17, 2007 at 23:10:14 +0100,
  Ola Thoresen <redhat olen net> wrote:
>
> One of the worst examples of this is the change to UTF-8 as default
> charset.  I am a devoted UTF-8 user myself, but it is probably the
> single change that has caused most pain for others, and it is stil
> causing trouble.

> When we changed to UTF-8 as default, there were no
> easy way to convert filesystems, documents, text-files, webpages...

Not sure if these two utilities could help:
(1) iconv -f old-encoding -t UTF-8 filename > newfilename

(2) utf8ize

The script:
http://ftp.penguin.cz/pub/users/utx/misc/utf8ize.gopts

The web page (search for utf8ize):
http://www.penguin.cz/~utx/


> The first thing almost everyone I know that are installing Fedora,
> Redhat or Suse is doing is to change /etc/sysconfig/i18n to go back to
> en_US as default LANG. Simply because it takes a h... of a lot of work
> to convert all your files and applications and there are no good tools
> out there to help you.

UTF-8 is an encoding and en_US is a locale. You are comparing different
types of things. Perhaps you meant that UTF-8 was being used instead of
ASCII or Latin 1? Note that ASCII is in a sense a subset of UTF-8, so
converting from ASCII to UTF-8 isn't a big deal.

Something that I don't feel GLib has done enough is to have enough API supporting non UTF-8 content. For example, if a text file is opened using GIOChannel, the read would fail if the file content isn't containing only UTF-8 content.

The fallback could be more graceful; for example, the API could allow a fallback charset to convert bytes that aren't legal UTF-8 byes to UTF-8. There should exist enough API that is as tolerant to non UTF-8 content as possible (such as using fallback charset).

For example, a lot of people could be using a single European charset before UTF-8 became mainstream. So, with just one fallback charset specified, all these people could have been covered. Their files could be opened and new files are saved as UTF-8 charset.

As it is now, if you want your application to support reading of both UTF-8 and ISO-8859-1 encodings (just the most common 2 sets, not more), most facilities in GLib are not a choice -- if one text file contains just one copyright symbol encoded in ISO-8859-1, you fail to read the entire text file...very far from an ideal scenario.

What do people think?


--
Daniel Yek


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]