Where is the $LANG variable defined?

Tue Nov 9 01:43:31 UTC 2004

Matthew Miller wrote:
> On Sun, Nov 07, 2004 at 07:50:00PM +0100, Björn Persson wrote:
> 
>>>And something does -- set the LANG variable right, and it'll work, right?
>>
>>If it's that easy, why doesn't SSH set LANG? Why should I have a lot of 
>>trouble doing this manually every time? And how does that help with 
>>filenames?
> 
> Answers to your questions in order. :)
> 
> 1. Because it has no idea what to set it to. 

It doesn't know because it doesn't bother to look, but that's the wrong 
answer. The answer is that it's *not* that easy. The encoding has to be 
set for the terminal program where I run SSH, and to do it through LANG, 
LANG must be set before the terminal program is started. So instead of 
just typing "ssh otherbox" I have to set LANG, launch a new terminal 
window and run SSH there.

> 2. You shouldn't have to go to a lot of trouble. And you shouldn't need to
>    do it manually each time -- at the very least, you can script it.

So everyone and their dog should write a specialized script for every 
combination of local and remote box? Fat chance! We need a solution that 
will work out of the box so it can be packaged and distributed with 
operating systems.

I *could* write a program that connects with SSH, looks up the remote 
system's encoding, disconnects, opens a new terminal window with the 
right encoding set, and runs SSH in that window, but it wouldn't work in 
text mode, it wouldn't work with chained SSH sessions, and it still 
wouldn't help with file transfers.

> 3. The filename encoding problem is kinda sticky. Devising a workable
>    on-the-fly transcoding solution seems like a lot of work on the
>    _symptoms_. Instead, let's work on getting everything to work well with
>    UTF-8.

You don't seem to fully understand the extent of the problem. That's not 
surprising as you're apparently a USian and seldom see the character 
encoding problems I see daily. If you had been regularly forced to spell 
your name "Mutthew" because "a" wasn't a valid character, you might have 
a different view.

What you're suggesting is that everyone should use UTF-8 everywhere so 
that there would only be one character encoding. That just isn't going 
to happen. I'd love to go UTF-8 myself and get access to all the world's 
written languages, but it's not feasible. That's not because of the 
filenames. I could transcode all my filenames easily enough. The real 
problem is the files' contents. Since there's only one big global locale 
setting I have to convert everything or nothing. I've got heaps of text 
files full of non-English letters and they're all encoded in Latin 1. 
(Actually, many of them are probably in Windows 1252, but the extra 
characters in that encoding don't seem to have gotten used very often, 
so they can pass as Latin 1.) Some of them are plain text. They could 
and would have to be transcoded. Others are XML or HTML. They could be 
left as they are but should be transcoded so they could be opened in 
text editors. If they are transcoded the embedded encoding 
specifications would have to be updated. Still others are source code. 
Transcoding those would constitute changes to the programs and could 
require several other changes. Then there are the files that aren't text 
and mustn't be transcoded by mistake. There's no reliable way of 
recognizing the different kinds of files automatically, so I'd have to 
go through them all manually and decide what to do with each one. No thanks!

Then there are the various people and computers I need to cooperate with 
and share files with - coworkers, Sourceforge projects and the like. 
These people share files with other people who in turn cooperate with 
still other people. What's the chance of getting all these people to 
switch character encodings at the same time?

While I'm typing this, Bittorrent is downloading Fedora 3 for me. I'm 
going to do a fresh install on an unused partition. The first thing I'll 
do after installing is to edit /etc/sysconfig/i18n to change from UTF-8 
to Latin 1. Sure it would be nice if everyone would use the same 
character encoding, but Unicode was created some 50 years too late and 
now we have to live with the consequences. Myself I'm stuck with Latin 1.

Björn Persson