Mike MacCana wrote:
Also, Lucene suffers from the Java UCS-16 scandal: they chose a character encoding which is good for Japanese, but bulks up european languages by a factor of two and doesn't support enough characters to do a good job with Chinese.. They (meaning engineers at redhat) are discussing this. The solution won't use Lucene, as Lucene treats all fine content as equal - ie, it doesn't know about headings being different from body text and so on.
Mike