Working with Text |
In many languages the sentence terminator is a period. The English language also uses the period to specify a decimal separator, to indicate an ellipsis mark, and to terminate abbreviations. Because the period has more than one purpose, theBreakIterator
class cannot always determine sentence boundaries with accuracy.First, let's look at a case where sentence boundary analysis does work. You start by creating a
BreakIterator
with thegetSentenceInstance
method:To show the sentence boundaries, use the theBreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);markBoundaries
method, which the previous section discussed. ThemarkBoundaries
method prints carets ('^') beneath a string to indicate boundary positions. In the following example, the sentence boundaries are properly identified:You can also locate the boundaries of sentences that end with question marks and exclamation points:She stopped. She said, "Hello there," and then went on. ^ ^ ^Using the period as a decimal point does not cause an error:He's vanished! What will we do? It's up to us. ^ ^ ^ ^An ellipsis mark (three spaced periods) indicates the omission of text within a quoted passage. In the next example, the ellipses erroneously generate sentence boundaries:Please add 1.5 liters to the tank. ^ ^Abbreviations might also cause errors. If a period is followed by whitespace and an uppercase letter, the"No man is an island . . . every man . . . " ^ ^ ^ ^ ^ ^^BreakIterator
detects a bogus sentence boundary:My friend, Mr. Jones, has a new dog. The dog's name is Spot. ^ ^ ^ ^
Working with Text |