Character encoding
Björn Persson
bjorn at xn--rombobjrn-67a.se
Sat Sep 6 23:56:07 UTC 2008
Adil Drissi wrote:
> I want to know what is the encoding type of a file. So i run this command:
> "file --mime index.php". The output is : index.php: text/html
>
> But this does not give any character encoding type.
>
> I would like to convert this file to UTF-8 but the command convmv cannot be
> run without specifying the type of the file with -f option i think.
There is no general way to find out the character encoding of a random piece
of data. Some encodings are fairly easy to recognize but the numerous
eight-bit encodings can be difficult to tell apart. The character encoding
must always be specified somewhere if it isn't implicitly known.
In some file systems it's possible to specify the character encoding of a file
as an attribute, but I've never seen it used. HTML can contain a meta tag
that specifies the encoding, like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If the HTML file is served by an HTTP server, then the server can specify the
encoding in the Content-Type header, and there are rules that define what the
encoding is if the server doesn't specify it.
You could open the file in a browser that lets you choose the encoding, and
try an encoding that you think it may be. Then proofread the text. If all the
characters are right, then you guessed right, or close enough to work for
that particular file. If not, try the next encoding.
> o is there a way to convert this file to UTF-8
Once you know the current encoding, transcoding won't be a big problem. If the
encoding is specified in the file, such as in a meta tag, then you'll have to
change that too.
> or better how to set the default character encoding to utf-8?
Default in what context? The locale settings in the environment include a
character encoding. Many programs assume that text files and filenames are
encoded in that encoding, but some programs think they're smarter and assume
something else. (The approach with environment variables will of course fail
if different users use different locales and access the same files.)
Björn Persson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/fedora-list/attachments/20080907/87fbc468/attachment-0001.sig>
More information about the fedora-list
mailing list