classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

BreakIterator.getSentenceInstance() deviations


From: Julian Scheid
Subject: BreakIterator.getSentenceInstance() deviations
Date: Wed, 11 Jul 2001 03:47:49 +0100

The javadoc drop-in I am working on relies on
BreakIterator.getSentenceInstance() for detecting the end of the first
sentence of a javadoc comment, as suggested by
http://java.sun.com/j2se/1.3/docs/tooldocs/javadoc/doclet/com/sun/javadoc/Ta
g.html#firstSentenceTags()

In this context I encountered some inconsistencies in various free core
libs (Classpath, libgcj.zip and Klasses.jar) with regard to JRE
behaviour.

A program demonstrating the problem is attached. Here are the results
when run in different environments (true linefeeds replaced by \n's for
clarity):

JRE 1.3.1:

  1 first sentence of 'Foo.\nBar.' is 'Foo.\n'
  2 first sentence of 'Foo. bar.' is 'Foo. bar.'
  3 first sentence of 'Foo. 123 Bar.' is 'Foo. 123 Bar.'

ORP 1.0.3, Kissme 0.13, Kaffe 1.0.6:

  1 first sentence of 'Foo.\nBar.' is 'Foo.\nBar.'       (inconsistent)
  2 first sentence of 'Foo. bar.' is 'Foo. '             (inconsistent)
  3 first sentence of 'Foo. 123 Bar.' is 'Foo. '         (inconsistent)

GIJ 0.0.7:

  1 first sentence of 'Foo.\nBar.' is 'Foo.\nBar.'       (inconsistent)
  2 first sentence of 'Foo. bar.' is 'Foo. bar.'         (ok)
  3 first sentence of 'Foo. 123 Bar.' is 'Foo. '         (inconsistent)


Line #2 reveals that both JRE and libgcj.zip ignore "period followed by
whitespace followed by lowercase letter" when looking for the end of a
sentence, but Classpath and Klasses.jar don't.

As demonstrated by line #3, JRE also ignores "period followed by
whitespace followed by digit", but all free implementations don't.

Finally, line #1 shows that JRE correctly identifies ".\n" as
end-of-sentence token, while other implementations ignore it. I assume
this is due to an incomplete definition of
LocaleInformation.sentence_breaks

  private static final String[] sentence_breaks = { ". " };

... which should be

  private static final String[] sentence_breaks
       = { ". ", ".\t", ".\r\n", ".\r", ".\n" };

(I'm not sure about element [3], please check this.)

With regard to Classpath, both gnu/java/locale/LocaleInformation_en.java
and .../LocaleInformation_nl.java need be touched for this.

Julian

Attachment: SentenceTest.java
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]