[publican-list] sortable lists, esp. glossaries

Tue Feb 14 04:28:25 UTC 2012

On Mon, Feb 13, 2012 at 02:01:23PM +1000, Jeff Fearn wrote:
> On 02/11/2012 01:58 AM, Fred Dalrymple wrote:

> >If I'm reading your post correctly... You're saying that simply invoking
> >a locale-specific collating sequence during publishing (i.e., with no
> >additional collating clues provided) may produce results acceptable to
> >each locale?

Choosing an order for things is harder than one might think even in English
(deciding whether to ignore a leading "The", deciding how to treat spaces or
punctuation).  But if the emphasis is on "acceptable" rather than "exactly
as a careful human expert might decide", and if "each locale" is understood
to be restricted to something like "each locale used by known Publican users"
(so excluding Cuneiform, Klingon and so on); then yes, that would be the
hypothesis at least.

> The link I posted to "a good explanation of the issues" is for the
> tests we ran using unicode collation.
> 
> https://www.redhat.com/archives/publican-list/2010-May/msg00025.html

Here it's important to distinguish between "unicode collation" in the sense of
the default unicode collation, and "unicode collation" when using the cldr
tailorings.

The above-linked message says that "we're getting all the Katakana first, then
all the Hiragana", whereas Unicode::Collate::Locale intersperses Katakana and
Hiragana when told to use a 'ja' locale.

[I must say that the result doesn't look to me like it's sorted "according
 to pronunciation", if I may judge solely from the latin names of the
 Hiragana and Katakana, and I was surprised to find that any latin
 characters after the Hiragana / Katakana affects the ordering in a complex
 way; but certainly Hiragana and Katakana entries are interspersed in the
 resulting sorted list.]

Btw, the possible issue I mentioned about Spanish seems to be out of date:
the official rules have now changed, such that ch is now to be sorted between
cg and ci, as in most other languages.  I've checked that
Unicode::Collate::Locale does indeed sort ch before ci with locale set to
'es' or 'es-ES', which is what we want.  (The old rules are available by
setting locale to 'es-ES-traditional'.)

pjrm.