Multibyte (MB) support is intended to allow
PostgreSQL to handle
multiple-byte character sets such as EUC (Extended UNIX Code), Unicode and
Mule internal code.
With multibyte support enabled you can use multi-byte
character sets in regular expressions (regexp), LIKE, and some
other functions. The default
encoding system is selected while initializing your
PostgreSQL installation using
initdb.
Note that this can be overridden when you create a database using
createdb or by using the SQL command
CREATE DATABASE, so you can have multiple databases each with
a different encoding system.
Multibyte support also fixes some problems concerning 8-bit single byte
character sets including ISO8859.
Run configure with the multibyte option:
% ./configure --enable-multibyte[=encoding_system]
|
where
encoding_system can be one of the
values in the following table:
Table 3-1. PostgreSQL Character Set Encodings
| Encoding | Description |
|---|
| SQL_ASCII | ASCII |
| EUC_JP | Japanese EUC |
| EUC_CN | Chinese EUC |
| EUC_KR | Korean EUC |
| EUC_TW | Taiwan EUC |
| UNICODE | Unicode(UTF-8) |
| MULE_INTERNAL | Mule internal |
| LATIN1 | ISO 8859-1 English and some European languages |
| LATIN2 | ISO 8859-2 English and some European languages |
| LATIN3 | ISO 8859-3 English and some European languages |
| LATIN4 | ISO 8859-4 English and some European languages |
| LATIN5 | ISO 8859-5 English and some European languages |
| KOI8 | KOI8-R |
| WIN | Windows CP1251 |
| ALT | Windows CP866 |
Here is an example of configuring
PostgreSQL to use a Japanese encoding by
default:
% ./configure --enable-multibyte=EUC_JP
|
If the encoding system is omitted (
./configure --enable-multibyte),
SQL_ASCII is assumed.
initdb defines the default encoding
for a PostgreSQL installation. For example:
sets the default encoding to EUC_JP (Extended UNIX Code for Japanese).
Note that you can use "
--encoding" instead of
"
-E" if you prefer
to type longer option strings.
If no
-E or
--encoding option is given, the encoding
specified at configure time is used.
You can create a database with a different encoding:
% createdb -E EUC_KR korean
|
will create a database named "korean" with
EUC_KR encoding.
Another way to accomplish this is to use a SQL command:
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
|
The encoding for a database is represented as an
encoding column in the
pg_database system catalog.
You can see that by using -l or \l of
psql
command.
$ psql -l
List of databases
Database | Owner | Encoding
---------------+---------+---------------
euc_cn | t-ishii | EUC_CN
euc_jp | t-ishii | EUC_JP
euc_kr | t-ishii | EUC_KR
euc_tw | t-ishii | EUC_TW
mule_internal | t-ishii | MULE_INTERNAL
regression | t-ishii | SQL_ASCII
template1 | t-ishii | EUC_JP
test | t-ishii | EUC_JP
unicode | t-ishii | UNICODE
(9 rows)
|
PostgreSQL supports an automatic
encoding translation between backend
and frontend for some encodings.
Table 3-2. PostgreSQL Client/Server Character Set Encodings
| Server Encoding | Available Client Encodings |
|---|
| EUC_JP | EUC_JP, SJIS |
| EUC_TW | EUC_TW, BIG5 |
| LATIN2 | LATIN2, WIN1250 |
| LATIN5 | LATIN5, WIN, ALT |
| MULE_INTERNAL | EUC_JP, SJIS, EUC_KR, EUC_CN,
EUC_TW, BIG5, LATIN1 to LATIN5,
WIN, ALT, WIN1250 |
To enable the automatic encoding translation, you have to tell
PostgreSQL the encoding you would like
to use in frontend. There are
several ways to accomplish this.
Using the \encoding command in
psql.
\encoding allows you to change frontend
encoding on the fly. For
example, to change the encoding to SJIS, type:
Using libpq functions.
\encoding actually calls
PQsetClientEncoding() for its purpose.
int PQsetClientEncoding(PGconn *conn, const char *encoding)
|
where conn is a connection to the backend,
and encoding is an encoding you
want to use. If it successfully sets the encoding, it returns 0,
otherwise -1. The current encoding for this connection can be shown by
using:
int PQclientEncoding(const PGconn *conn)
|
Note that it returns the "encoding id," not the encoding symbol string
such as "EUC_JP." To convert an encoding id to an encoding symbol, you
can use:
char *pg_encoding_to_char(int encoding_id)
|
Using SET CLIENT_ENCODING TO.
You can set frontend encoding with this SQL command:
SET CLIENT_ENCODING TO 'encoding';
|
Also you can use SQL92 syntax "SET NAMES" for this purpose:
To query the current frontend encoding:
To return to the default encoding:
Using PGCLIENTENCODING.
If environment variable PGCLIENTENCODING is defined
in the client's environment, that client encoding is automatically
selected when a backend connection is made. This can subsequently
be overridden using any of the other methods mentioned above.
An automatic encoding translation between Unicode and other
encodings is supported.
However, as this requires large conversion tables, it is not enabled by default.
To enable this feature, run configure with the
--enable-unicode-conversion option.
Note that this requires
the --enable-multibyte option also.
Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
then some Japanese characters could not be translated into LATIN1. In
this case, a letter that cannot be represented in the LATIN1 character set
would be transformed as: