Previous | Next | Trail Map | Internationalization | Working with Text

Sentence Boundaries

In many languages the sentence terminator is a period. The English language also uses the period to specify a decimal separator, to indicate an ellipsis mark, and to terminate abbreviations. Because the period has more than one purpose, the BreakIterator class cannot always determine sentence boundaries with accuracy.

First, let's look at a case where sentence boundary analysis does work. You start by creating a BreakIterator with the getSentenceInstance method:

BreakIterator sentenceIterator =
   BreakIterator.getSentenceInstance(currentLocale);
To show the sentence boundaries, use the the markBoundaries method, which the previous section discussed. The markBoundaries method prints carets ('^') beneath a string to indicate boundary positions. In the following example, the sentence boundaries are properly identified:
She stopped.  She said, "Hello there," and then went on.
^             ^                                         ^
You can also locate the boundaries of sentences that end with question marks and exclamation points:
He's vanished!  What will we do?  It's up to us.
^               ^                 ^             ^
Using the period as a decimal point does not cause an error:
Please add 1.5 liters to the tank.
^                                 ^
An ellipsis mark (three spaced periods) indicates the omission of text within a quoted passage. In the next example, the ellipses erroneously generate sentence boundaries:
"No man is an island . . . every man . . . "
^                      ^ ^             ^ ^ ^^
Abbreviations might also cause errors. If a period is followed by whitespace and an uppercase letter, the BreakIterator detects a bogus sentence boundary:
My friend, Mr. Jones, has a new dog.  The dog's name is Spot.
^              ^                      ^                      ^


Previous | Next | Trail Map | Internationalization | Working with Text