Why is "LANG=en_US.UTF-8" the default in Fedora

Fri May 21 21:00:36 UTC 2004

----- Original Message ----- 
From: "Alan Cox" <alan at redhat.com>
To: "For testers of Fedora Core development releases"
<fedora-test-list at redhat.com>
Sent: Friday, May 21, 2004 2:28 PM
Subject: Re: Why is "LANG=en_US.UTF-8" the default in Fedora

> On Fri, May 21, 2004 at 02:11:37PM -0400, Nico Kadel-Garcia wrote:
> > That's what "sort -i" is for. There is *no*, I repeat *no* way to get
sort
> > to operate as a case-sensitive operation short of resetting your locale
> > before beginning.
>
> You can repeat all you like you are still confused about it.
>
> Simple demo. Input file consists of abacus and Abacus randomly ordered
>
>  LANG=en_US.UTF-8 sort demo
> abacus
> abacus
> abacus
> abacus
> abacus
> abacus
> abacus
> abacus
> Abacus
> Abacus
> Abacus
> Abacus
> Abacus
> Abacus
> Abacus
> Abacus
> Abacus
>
> Output is sorted.

Yes, in exactly the "case insensitive" fashion that "sort" has used for the
last 20 years or so.

With "LANG-en_US.UTF-8 ls", we get lists like this:

a
A
ab
aB
Ab
AB
abc
abC
aBc
aBC
Abc
AbC
ABc
ABC

With "LANG=C;  ls | sort -i", we get the same thing:

a
A
ab
aB
Ab
AB
abc
abC
aBc
aBC
Abc
AbC
ABc
ABC

Looks identical, doesn't it? Also, notice that the items starting "AB" are
no longer together. "AB" is entirely separate from "AB[cC]"

With "LANG=C ls", we get.

A
AB
ABC
ABc
Ab
AbC
Abc
a
aB
aBC
aBc
ab
abC
abc

Notice some wildly, wildly different behavior there, such as everything that
starts with "AB" actually being grouped together? Now, guess how much old
source code in the world was written in the days when ASCII meant ASCII, and
sorting was predictable, not randomly dependent on a set of semi-randomly
assigned locale's that cannot be predicted, and now require an additional
programming step of checking if your system supports locales and setting
them appropriately?

The effective change was from the "C" standard to what you are describing as
the "More Natural" sorting of en.US_UTF8, or whichever locale we happen to
choose from moment to moment. This is conceptually reasonable, but the
change has been breaking old code and unexpectedly multi-lingual code for
the last few years. The shift to Unicode has been extremely painful for a
lot of programmers, including me, and remains painful as I have to clean up
tools or code from old source or other locations that make unwitting
assumptions about this sort of behavior.