Unicode support in fedora (was: Re: flac/mp3 tagging Latin characters)

Sun Dec 19 23:04:19 UTC 2004

James, 

thanks for the detailed explanation. I actually understand unicode
pretty well but I am curious why it doesn't "just work" by now in
Fedora and probably other linux distributions.
I use Japanese on my desktop and and I still have some problems such
as tagging mp3s or exchanging files with people on Japanese windows.

To be fair it has improved immensely but I will be very happy when I
won't have to use smbchartool
(http://www.samba.gr.jp/project/contrib/smbchartool.html) anymore.

Nadeem

On Sun, 19 Dec 2004 20:52:30 +0000, James Wilkinson
<james at westexe.demon.co.uk> wrote:
> Nadeem Bitar wrote:
> > I'm interested to know why en_US works but en_US.UTF-8 doesn't.
> (The context was id3 tags in MP3s etc)
> 
> Computers store letters as binary numbers.
> 
> The standard way of encoding Latin letters is the ASCII encoding. In
> anything ASCII based, for example, A is (decimal) 65. ASCII covers the
> symbols on a standard US keyboard, and uses numbers up to 127.
> 
> Historically, Western computers have stored each character in one byte.
> That gives you up to 256 characters.
> 
> Many people want to use other symbols. For example, I might want to use
> the £ and € signs for currency. Greeks and Russians will want to use
> their own letters (Ω or Ж). People speaking French or Spanish will want
> to use áçcèñts. And you want to properly tag your MP3s.
> 
> In fact, there are *way* more symbols than can be encoded in one byte.
> So a number of "character sets" were invented: some for Greek letters,
> some for Russian, some for Western European, etc. Usually the first half
> was ASCII, and the rest character-set specific.
> 
> And the problem is that it isn't always clearly specified which
> character set you're using. I suspect that's what's happening here: the
> encoder and the player are using different character sets.
> 
> UTF-8 is a way of encoding practically any character, possibly in more
> than one byte. If and when it becomes universal, then character set
> problems should go away.  But it's also another character set, so for
> now, if an encoding program encodes symbols in UTF-8, but the readers
> expect them to be in ISO 8859-1 ("Western Europe"), you'll have trouble.
> 
> Now the LANG variable, among other things, sets which character set is
> in use. en_US uses ISO 8859-1, while en_US.UTF-8 uses UTF-8 (not
> surprisingly). So using en_US gets your MP3s using the ISO8859-1
> encoding that the MP3 players expect (because the encoder works that way
> but the decoders presumably don't...)
> 
> I have not been able to find if there is a character set specification
> in id3 tags that one program or another is ignoring, or whether the
> standard is simply deficient.
> 
> With e-mails, for example, there's a MIME-Version and a Content-Type
> header that specify that this e-mail is using UTF-8 (because that's the
> only character set that covers everything I've used).
> 
> James.
> 
> Yes, I know, I've massively simplified in places.
> 
> --
> E-mail address: james | DON'T be put off by "horror stories" spread by
> @westexe.demon.co.uk  | others.  People who talk about death and serious
>                       | injury are very rarely the ones who have actually
>                       | suffered such things.  -- Adrian Plass
> 
> --
> fedora-list mailing list
> fedora-list at redhat.com
> To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list
>