[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
BreakIterator.getSentenceInstance() deviations
From: |
Julian Scheid |
Subject: |
BreakIterator.getSentenceInstance() deviations |
Date: |
Wed, 11 Jul 2001 03:47:49 +0100 |
The javadoc drop-in I am working on relies on
BreakIterator.getSentenceInstance() for detecting the end of the first
sentence of a javadoc comment, as suggested by
http://java.sun.com/j2se/1.3/docs/tooldocs/javadoc/doclet/com/sun/javadoc/Ta
g.html#firstSentenceTags()
In this context I encountered some inconsistencies in various free core
libs (Classpath, libgcj.zip and Klasses.jar) with regard to JRE
behaviour.
A program demonstrating the problem is attached. Here are the results
when run in different environments (true linefeeds replaced by \n's for
clarity):
JRE 1.3.1:
1 first sentence of 'Foo.\nBar.' is 'Foo.\n'
2 first sentence of 'Foo. bar.' is 'Foo. bar.'
3 first sentence of 'Foo. 123 Bar.' is 'Foo. 123 Bar.'
ORP 1.0.3, Kissme 0.13, Kaffe 1.0.6:
1 first sentence of 'Foo.\nBar.' is 'Foo.\nBar.' (inconsistent)
2 first sentence of 'Foo. bar.' is 'Foo. ' (inconsistent)
3 first sentence of 'Foo. 123 Bar.' is 'Foo. ' (inconsistent)
GIJ 0.0.7:
1 first sentence of 'Foo.\nBar.' is 'Foo.\nBar.' (inconsistent)
2 first sentence of 'Foo. bar.' is 'Foo. bar.' (ok)
3 first sentence of 'Foo. 123 Bar.' is 'Foo. ' (inconsistent)
Line #2 reveals that both JRE and libgcj.zip ignore "period followed by
whitespace followed by lowercase letter" when looking for the end of a
sentence, but Classpath and Klasses.jar don't.
As demonstrated by line #3, JRE also ignores "period followed by
whitespace followed by digit", but all free implementations don't.
Finally, line #1 shows that JRE correctly identifies ".\n" as
end-of-sentence token, while other implementations ignore it. I assume
this is due to an incomplete definition of
LocaleInformation.sentence_breaks
private static final String[] sentence_breaks = { ". " };
... which should be
private static final String[] sentence_breaks
= { ". ", ".\t", ".\r\n", ".\r", ".\n" };
(I'm not sure about element [3], please check this.)
With regard to Classpath, both gnu/java/locale/LocaleInformation_en.java
and .../LocaleInformation_nl.java need be touched for this.
Julian
SentenceTest.java
Description: Binary data
- BreakIterator.getSentenceInstance() deviations,
Julian Scheid <=