Word and Sentence Tokenization with Hidden Markov Models

We present a novel method for the segmentation of text into tokens and sentences. Our approach makes use of a Hidden Markov Model for the detection of segment boundaries. Model parameters can be estimated from pre-segmented text which is widely available in the form of treebanks or aligned multi-lingual corpora. We formally deﬁne the boundary detection model and evaluate its performance on corpora from various languages as well as a small corpus of computer-mediated communication


Introduction
Detecting token and sentence boundaries is an important preprocessing step in natural language processing applications since most of these operate either on the level of words (e.g.syllabification, morphological analysis) or sentences (e.g.part-of-speech tagging, parsing, machine translation).The primary challenges of the tokenization task stem from the ambiguity of certain characters in alphabetic and from the absence of explicitly marked word boundaries in symbolic writing systems.The following German example illustrates different uses of the dot character1 to terminate an abbreviation, an ordinal number, or an entire sentence.

A.
A.

th
Geburtstag. birthday.'On 24/1/1806, E. T. A. Hoffmann celebrated his 30 th birthday.' Recently, the advent of instant written communications over the internet and its increasing share in people's daily communication behavior has posed new challenges for existing approaches to language processing: computermediated communication (CMC) is characterized by a creative use of language and often substantial deviations from orthographic standards.For the task of text segmentation, this means dealing with unconventional uses of punctuation and letter-case, as well as genre-specific elements such as emoticons and inflective forms (e.g."*grins*").CMC sub-genres may differ significantly in their degree of deviation from orthographic norms.Moderated discussions from the university context are almost standard-compliant, while some passages of casual chat consist exclusively of metalinguistic items.In addition, CMC exhibits many structural similarities to spoken language.It is in dialogue form, contains anacolutha and self-corrections, and is discontinuous in the sense that utterances may be interrupted and continued at some later point in the conversation.Altogether, these phenomena complicate automatic text segmentation considerably.
In this paper, we present a novel method for the segmentation of text into tokens and sentences.It uses a Hidden Markov Model (HMM) for the classification of segment boundaries.The remainder of this work is structured as follows: First, we describe the tasks of tokenization and EOS detection and summarize some relevant previous work on the topic.Section 2 contains a description of our approach, including a formal definition of the underlying HMM.In Section 3, we present an empirical evaluation of our approach with respect to conventional corpora from five different European languages as well as a small corpus of CMC text, comparing results to those achieved by a state-of-the-art tokenizer.

Task Description
Tokenization and EOS detection are often treated as separate text processing stages.First, the input is segmented into atomic units or word-like tokens.Often, this segmentation occurs on whitespace, but punctuation must be considered as well, which is often not introduced by whitespace, as for example in the case of the commas in Examples (1) and (2).Moreover, there are tokens which may contain internal whitespace, such as cardinal numbers in German, in which a single space character may be used as thousands separator.The concept of a token is vague and may even depend on the client application: New York might be considered a single token for purposes of named entity recognition, but two tokens for purposes of syntactic parsing.
In the second stage, sentence boundaries are marked within the sequence of word-like tokens.There is a set of punctuation characters which typically introduce sentence boundaries: the "usual suspects" for sentence-final punctuation characters include the question mark ("?"), exclamation point ("!"), ellipsis (". .."), colon (":"), semicolon (";"), and of course the full stop (".").Unfortunately, any of these items can mislead a simple rule-based sentence splitting procedure.Apart from the different uses of the dot character illustrated in Ex. ( 1), all of these items can occur sentence-internally (e.g. in direct quotations like "'Stop!' he shouted."),or even token-internally in the case of complex tokens such as URLs.Another major difficulty for EOS detection arises from sentence boundaries which are not explicitly marked by punctuation, as e.g. for newspaper headlines.

Existing Approaches
Many different approaches to tokenization and EOS detection have been proposed in the literature.He and Kayaalp (2006) give an interesting overview of the characteristics and performance of 13 tokenizers on biomedical text containing challenging tokens like DNA sequences, arithmetical expressions, URLs, and abbreviations.Their evaluation focuses on the treatment of particular phenomena rather than general scalar quantities such as precision and recall, so a clear "winner" cannot be determined.While He and Kayaalp (2006) focus on existing and freely available tokenizer implementations, we briefly present here the theoretical characteristics of some related approaches.All of these works have in common that they focus on the disambiguation of the period as the most likely source of difficulties for the text segmentation task.
Modes of evaluation differ for the different approaches, which makes direct comparisons difficult.Results are usually reported in terms of error rate or accuracy, often focusing on the performance of the disambiguation of the period.In this context, Palmer and Hearst (1997) define a lower bound for EOS detection as "the percentage of possible sentence-ending punctuation marks [. ..] that indeed denote sentence boundaries."The Brown Corpus (Francis and Kucera, 1982) and the Wall Street Journal (WSJ) subset of the Penn Treebank (Marcus et al., 1993) are the usual test corpora, relying on the assumption that the manual assignment of a part-of-speech (PoS) tag to a token requires prior manual segmentation of the text.Riley (1989) trains a decision tree with features including word length, letter-case and probability-at-EOS on pre-segmented text.He uses the 25 million word AP News text database for training and reports 99.8% accuracy for the task of identifying sentence boundaries introduced by a full stop in the Brown corpus.Grefenstette and Tapanainen (1994) use a set of regular rules and some lexica to detect occurrences of the period which are not EOS markers.They exhibit rules for the treatment of numbers and abbreviations and report a rate of 99.07%correctly recognized sentence boundaries for their rule-based system on the Brown corpus.Palmer and Hearst (1997) present a system which makes use of the possible PoS-tags of the words surrounding potential EOS markers to assist in the disambiguation task.Two different kinds of statistical models (neural networks and decision trees) are trained from manually PoS-tagged text and evaluated on the WSJ corpus.The lowest reported error rates are 1.5% for a neural network and 1.0% for a decision tree.Similar results are achieved for French and German.Mikheev (2000) extends the approach of Palmer and Hearst (1997) by incorporating the task of EOS detection into the process of PoS tagging, and thereby allowing the disambiguated PoS tags of words in the immediate vicinity of a potential sentence boundary to influence decisions about boundary placement.He reports error rates of 0.2% and 0.31% for EOS detection on the Brown and the WSJ corpus, respectively.Mikheev's treatment also gives the related task of abbreviation detection much more attention than previous work had.Making use of the internal structure of abbreviation candidates, together with the surroundings of clear abbreviations and a list of frequent abbreviations, error rates of 1.2% and 0.8% are reported for Brown and WSJ corpus, respectively.
While the aforementioned techniques use pre-segmented or even pre-tagged text for training model parameters, Schmid (2000) proposes an approach which can use raw, unsegmented text for training.He uses heuristically identified "unambiguous" instances of abbreviations and ordinal numbers to estimate probabilities for the disambiguation of the dot character, reporting an EOS detection accuracy of 99.79%.More recently, Kiss and Strunk (2006) presented another unsupervised approach to the tokenization problem: the Punkt system.Its underlying assumption is that abbreviations may be regarded as collocations between the abbreviated material and the following dot character.Significant collocations are detected within a training stage using log-likelihood ratios.While the detection of abbreviations through the collocation assumption involves type-wise decisions, a number of heuristics involving its immediate surroundings may cause an abbreviation candidate to be reclassified on the token level.Similar techniques are applied to possible ellipses and ordinal numbers, and evaluation is carried out for a number of different languages.Results are reported for both EOS-and abbreviation-detection in terms of precision, recall, error rate, and unweighted F score.Results for EOS detection range from F = 98.83% for Estonian to F = 99.81% for German, with a mean of F = 99.38%over all tested languages; and for abbreviation detection from F = 77.80%for Swedish to F = 98.68% for English with a mean of F = 90.93%over all languages.
A sentence-and token-splitting framework closely related to the current approach is presented by Tomanek et al. (2007), tailored to the domain of biomedical text.Such text contains many complex tokens such as chemical terms, protein names, or chromosome locations which make it difficult to tokenize.Tomanek et al. (2007) propose a supervised approach using a pair of conditional random field (CRF) classifiers to disambiguate sentence-and token-boundaries in whitespace-separated text.In contrast to the standard approach, EOS detection takes place first, followed by token-boundary detection.The classifiers are trained on pre-segmented data, and employ both lexical and contextual features such as item text, item length, letter-case, and whitespace adjacency.Accuracies of 99.8% and 96.7% are reported for the tasks of sentence-and token-splitting, respectively.

HMM-based Tokenization
In this section, we present our approach to token-and sentence-boundary detection using a Hidden Markov Model to simultaneously detect both word and sentence boundaries in a stream of candidate word-like segments returned by a low-level scanner.Section 2.1 briefly describes some requirements on the low-level scanner, while Section 2.2 is dedicated to the formal definition of the HMM itself.

Scanner
The scanner we employed in the current experiments uses Unicode2 character classes in a simple rule-based framework to split raw corpus text on whitespace and punctuation.The resulting pre-tokenization is "prolix" in the sense that many scan-segment boundaries do not in fact correspond to actual word or sentence boundaries.In the current framework, only scan-segment boundaries can be promoted to full-fledged token or sentence boundaries, so the scanner output must contain at least these. 3In particular, unlike most other tokenization frameworks, the scanner also returns whitespace-only pseudo-tokens, since the presence or absence of whitespace can constitute useful information regarding the proper placement of token and sentence boundaries.In Ex. (3) for instance, whitespace is crucial for the correct classification of the apostrophes.

HMM Boundary Detector
Given a prolix segmentation as returned by the scanner, the task of tokenization can be reduced to one of classification: we must determine for each scanner segment whether or not it is a word-initial segment, and if so, whether or not it is also a sentence-initial segment.To accomplish this, we make use of a Hidden Markov Model which encodes the boundary classes as hidden state components, in a manner similar to that employed by HMM-based chunkers (Church, 1988;Skut and Brants, 1998).In order to minimize the number of model parameters and thus ameliorate sparse data problems, our framework maps each incoming scanner segment to a small set of salient properties such as word length and typographical class in terms of which the underlying language model is then defined.

Segment Features
Formally, our model is defined in terms of a finite set of segment features.In the experiments described here, we use the observable features class, case, length, stop, abbr, and blanks together with the hidden features bow, bos, and eos to specify the language model.We treat each feature f as a function from candidate tokens (scanner segments) to a characteristic finite set of possible values rng(f ).The individual features and their possible values are described in more detail below, and summarized in Table 1.
• [class] represents the typographical class of the segment.Possible values are given in Table 2.
• [case] represents the letter-case of the segment.Possible values are cap for segments in all-capitals, up for segments with an initial capital letter, or lo for all other segments.
• [stop] contains the lower-cased text of the segment just in case the segment is a known stopword; i.e.only in conjunction with [class : stop].We used the appropriate language-specific stopwords distributed with the Python NLTK package whenever available, and otherwise an empty stopword list.
• [blanks] is a binary feature indicating whether or not the segment is bounded by whitespace on the left.
• [abbr] is a binary feature indicating whether or not the segment represents a known abbreviation, as determined by membership in a user-specified language-specific abbreviation lexicon.Since no abbreviation lexica were used for the current experiments, this feature was vacuous and will be omitted henceforth.4 • [bow] is a hidden binary feature indicating whether or not the segment is to be considered token-initial.
• [bos] is a hidden binary feature indicating whether or not the segment is to be considered sentence-initial.
• [eos] is a hidden binary feature indicating whether or not the segment is to be considered sentence-final.
Sentence boundaries are only predicted by the final system if a [+eos] segment is immediately followed by a [+bos] segment.
Among these, the feature stop is context-independent in the sense that we do not allow it to contribute to the boundary detection HMM's transition probabilities.We call all other features context-dependent or contextual.An example of how the features described above can be used to define sentence-and token-level segmentations is given in Figure 1.

Language Model
Formally, let F surf = {class, case, length, stop, blanks} represent the set of surface features, let F noctx = {stop} represent the set of context-independent features, and let F hide = {bow, bos, eos} represent the set of hidden features, and for any finite set of features F = {f 1 , f 2 , . . ., f n } over objects from a set S, let F be a composite feature function representing the conjunction over all individual features in F as an n-tuple:  Then, the boundary detection HMM can be defined in the usual way (Rabiner, 1989;Manning and Schütze, 1999) as the 5-tuple D = Q, O, Π, A, B , where: is a finite set of model states, where each state q ∈ Q is represented by a 7-tuple of values for the contextual features class, case, length, blanks, bow, bos, and eos; 2. O = rng ( F surf ) is a finite set of possible observations, where each observation is represented by a 5-tuple of values for the surface features class, case, length, blanks, and stop; ) is a conditional probability distribution over state k-grams representing the model's state transition probabilities; and ) is a probability distribution over observations conditioned on states representing the model's emission probabilities.
Using the shorthand notation w i+j i for the string w i w i+1 . . .w i+j , and writing f O (w) for the observable features [ F surf ](w) of a given segment w, the model D computes the probability of a segment sequence w n 1 as the sum of path probabilities over all possible generating state sequences: Assuming suitable boundary handling for negative indices, joint path probabilities themselves are computed as: Underlying these equations are the following assumptions: Equation ( 4) asserts that state transition probabilities depend on at most the preceding k − 1 states and thus on the contextual features of at most the preceding k − 1 segments.Equation (5) asserts the independence of a segment's surface features from all but the model's current state, formally expressing the context-independence of F noctx .In the experiments described below, we used scan-segment trigrams (k = 3) extracted from a training corpus to define language-specific boundary detection models in a supervised manner.To account for unseen trigrams, the empirical distributions were smoothed by linear interpolation of uni-, bi-, and trigrams (Jelinek and Mercer, 1980), using the method described by Brants (2000) to estimate the interpolation coefficients.

Runtime Boundary Placement
Having defined the disambiguator model D, it can be used to predict the "best" possible boundary placement for an input sequence of scanner segments W by application of the well-known Viterbi algorithm (Viterbi, 1967).Formally, the Viterbi algorithm computes the state path with maximal probability for the observed input sequence: If q 1 , . . ., q n = Viterbi(W, D) is the optimal state sequence returned by the Viterbi algorithm for the input sequence W , the final segmentation into word-like tokens is defined by placing a word boundary immediately preceding all and only those segments w i with i = 1 or q i [bow] = +.Similarly, sentence boundaries are placed before all and only those segments w i with i = 1 or q i [bow] = q i [bos] = q i−1 [eos] = +.Informally, this means that every input sequence will begin a new word and a new sentence, every sentence boundary must also be a  word boundary, and a high-level agreement heuristic is enforced between adjacent eos and bos features.5Since all surface feature values are uniquely determined by observed segment and only the hidden segment features bow, bos, and eos are ambiguous, only those states q i need to be considered for a segment w i which agree with respect to surface features, which represents a considerable efficiency gain, since the Viterbi algorithm's time complexity is exponential in the number of states considered per observation.

Experiments
In this section, we present four experiments designed to test the efficacy of the HMM-based boundary detection framework described above.After describing the corpora and software used for the experiments in Section 3.1 and formally defining our evaluation criteria in Section 3.2, we first compare the performance of our approach to that of the Punkt system introduced by Kiss and Strunk (2006) on corpora from five different European languages in Section 3.3.In Section 3.4 we investigate the effect of training corpus size on HMM-based boundary detection, while Section 3.5 deals with the effect of some common typographical conventions.Finally, Section 3.6 describes some variants of the basic model and their respective performance with respect to a small corpus of computer-mediated communication.

Materials
Corpora We used several freely available corpora from different languages for training and testing.Since they were used to provide ground-truth boundary placements for evaluation purposes, we required that all corpora provide both word-and sentence-level segmentation.For English (en), we used the Wall Street Journal texts from the Penn Treebank (Marcus et al., 1993) as distributed with the Prague Czech-English Dependency Treebank (Cuřín et al., 2004), while the corresponding Czech translations served as the test corpus for Czech (cz).The TIGER treebank (de; Brants et al., 2002) was used for our experiments on German.6 French data were taken from the 'French Treebank' (fr) described by Abeillé et al. (2003), which also contains annotations for multi-word expressions which we split into their components.7For Italian, we chose the Turin University Treebank (it; Bosco et al., 2000).To evaluate the performance of our approach on non-standard orthography, we used a subset of the Dortmund Chat Corpus (chat; Beißwenger and Storrer, 2008).Since pre-segmented data are not available for this corpus, we extracted a sample of the corpus containing chat logs from different scenarios (media, university context, casual chats) and manually inserted token and sentence boundaries.
To support the detection of (self-)interrupted sentences, we grouped each user's postings and ordered them according to their respective timestamps.Table 3 summarizes some basic properties of the corpora used for training and evaluation.
Software The text segmentation system described in Sec. 2 was implemented in C++ and Perl.The initial prolix segmentation of the input stream into candidate segments was performed by a traditional lex-like scanner generated from a set of 40 hand-written regular expressions by the scanner-generator RE2C (Bumbulis and Cowan, 1993).8HMM training, smoothing, and runtime Viterbi decoding were performed by the moot part-of-speech tagging suite (Jurish, 2003).Viterbi decoding was executed using the default beam pruning coefficient of one thousand in moot's "streaming mode," flushing the accumulated hypothesis space whenever an unambiguous token was encountered in order to minimize memory requirements without unduly endangering the algorithm's correctness (Lowerre, 1976;Kempe, 1997).To provide a direct comparison with the Punkt system beyond that given by Kiss and Strunk (2006), we used the nltk.tokenize.punktmodule distributed with the Python NLTK package.Boundary placements were evaluated with the help of GNU diff (Hunt and McIlroy, 1976;MacKenzie et al., 2002) operating on one-word-per-line "vertical" files.

Cross-Validation
Except where otherwise noted, HMM tokenizers were tested by 10-fold cross-validation to protect against model over-fitting: each test corpus C was partitioned on true sentence boundaries into 10 strictly disjoint subcorpora {c i } 1≤i≤10 of approximately equal size, and for each evaluation subcorpus c i , an HMM trained on the remaining subcorpora j =i c j was used to predict boundary placements in c i .Finally, the automatically annotated evaluation subcorpora were concatenated and evaluated with respect to the original test corpus C. Since the Punkt system was designed to be trained in an unsupervised fashion from raw untokenized text, no cross-validation was used in the evaluation of Punkt tokenizers.

Evaluation Measures
The tokenization method described above was evaluated with respect to the ground-truth test corpora in terms of precision, recall, and the harmonic precision-recall average F, as well as an intuitive scalar error rate.Formally, for a given corpus and a set B relevant of boundaries (e.g.token-or sentence-boundaries) within that corpus, let B retrieved be the set of boundaries of the same type predicted by the tokenization procedure to be evaluated.Tokenizer precision (pr) and recall (rc) can then be defined as: Precision thus reflects the likelihood of a true boundary given its prediction by the tokenizer, while recall reflects the likelihood that that a boundary will in fact be predicted given its presence in the corpus.In addition to these measures, it is often useful to refer to a single scalar value on the basis of which to compare tokenization quality.The unweighted harmonic precision-recall average F (van Rijsbergen, 1979) is often used for this purpose: In the sequel, we will also report tokenization error rates (Err) as the ratio of errors to all predicted or true boundaries: To allow direct comparison with the results reported for the Punkt system by Kiss and Strunk (2006), we will also employ the scalar measure used there, which we refer to here as the "Kiss-Strunk error rate" (Err KS ): Err KS = fp + fn number of all candidates (11) Table 4: Performance on dot-terminated sentences, evaluation following Kiss and Strunk (2006).
Since the Kiss-Strunk error rate only applies to sentence boundaries indicated by a preceding full we assume that the "number of candidates" referred to in the denominator is simply the number of dot-final tokens in the corpus.

Experiment 1: HMM vs. Punkt
We compared the performance of the HMM-based tokenization architecture described in Sec. 2 to that of the Punkt tokenizer described by Kiss and Strunk (2006) on each of the five conventional corpora from Table 3, evaluating the tokenizers with respect to both sentence-and word-boundary prediction.

Sentence Boundaries
Since Punkt is first and foremost a disambiguator for sentence boundaries indicated by a preceding full stop, we will first consider the models' performance on these, as given in Table 4.For all tested languages, the Punkt system achieved a higher recall on dot-terminated sentence boundaries, representing an average relative recall error reduction rate of 54.3% with respect to the HMM tokenizer.The HMM tokenizer exhibited greater precision however, providing an average relative precision error reduction rate of 73.9% with respect to Punkt.The HMM-based technique incurred the fewest errors overall, and those errors which it did make were more uniformly distributed between false positives and false negatives, leading to higher F values and lower Kiss-Strunk error rates for all tested corpora except French.
It is worth noting that the error rates we observed for the Punkt system as reported in Table 4 differ from those reported in Kiss and Strunk (2006).In most cases, these differences can be attributed to the use of different corpora.The most directly comparable values are assumedly those for English, which in both cases were computed based on samples from the Wall Street Journal corpus: here, we observed a similar error rate (1.10%) to that reported by Kiss and Strunk (1.65%), although Kiss and Strunk observed fewer false positives than we did.These differences may stem in part from incompatible criteria regarding precisely which dots can legitimately be regarded as sentence-terminal, since Kiss and Strunk provide no formal definition of what exactly constitutes a "candidate" for the computation of Eq. ( 11).In particular, it is unclear how sentence-terminating full stops which are not themselves in sentence-final position -as often occurring in direct quotations (e.g."He said 'stop.'") -are to be treated.
Despite its excellent recall for dot-terminated sentences, the Punkt system's performance dropped dramatically when considering all sentence boundaries (Table 5), including those terminated e.g. by question marks, exclamation points, colons, semicolons, or non-punctuation characters.Our approach outperformed Punkt on global sentence boundary detection for all languages and evaluation modes except precision on the German TIGER corpus (98.01%for the HMM tokenizer vs. 98.83% for Punkt).Overall, the HMM technique incurred only about half as many sentence boundary detection errors as Punkt (µ = 49.3%, σ = 15.3%).This is relatively unsurprising, since Punkt's rule-based scanner stage is responsible for detecting any sentence boundary not introduced by a full stop, while the HMM approach can make use of token context even in the absence of a dot character.Table 6: Overall performance on word-boundary detection.

Word Boundaries
The differences between our approach and that of Kiss and Strunk become even more apparent for word boundaries.As the data from Table 6 show, the HMM tokenizer substantially outperformed Punkt on word boundary detection for all languages and all evaluation modes, reducing the number of word-boundary errors by over 85% on average (µ = 85.6%, σ = 16.0%).Once again, this behavior can be explained by Punkt's reliance on strict rule-based heuristics to predict all token boundaries except those involving a dot on the one hand, and the HMM technique's deferral of all final decisions to the model-dependent runtime decoding stage on the other.In this manner, our approach is able to adequately account for both "prolix" target tokenizations such as that given by the Czech corpus -which represents e.g.adjacent single quote characters (") as separate tokensas well as "terse" tokenizations such as that of the English corpus, which conflates e.g.genitive apostrophe-s markers ('s) into single tokens.While it is almost certainly true that better results for Punkt than those presented in Table 6 could be attained by using additional language-specific heuristics for tokenization, we consider it to be a major advantage of our approach that it does not require such fine-tuning, but rather is able to learn the "correct" word-level tokenization from appropriate training data.Although the Punkt system was not intended to be an all-purpose word-boundary detector, it was specifically designed to make reliable decisions regarding the status of word boundaries involving the dot character, in particular abbreviations (e.g."etc.","Inc.") and ordinals ("24.").Restricting the evaluation to dot-terminated words containing at least one non-punctuation character produces the data in Table 7.Here again, the HMM tokenizer substantially outperformed Punkt for all languages10 and all evaluation modes except for precision on the German corpus (98.13% for the HMM tokenizer vs. 99.17%for Punkt), incurring on average 62.1% fewer errors than Punkt (σ = 15.5%).

Experiment 2: HMM Training Corpus Size
It was mentioned above that our approach relies on supervised training from a pre-segmented corpus to estimate the model parameters used for runtime boundary placement prediction.Especially in light of the relatively high error-rates observed for the smallest test corpus (Italian), this requirement raises the question of how much training material is in fact necessary to ensure adequate runtime performance of our model.To address such concerns, we varied the amount of training data used to estimate the HMM's parameters between 10,000 and 100,000 tokens,11 using cross-validation to compute averages for each training-size condition.Results for this experiment are given in Figure 2.
All tested languages showed a typical logarithmic learning curve for both sentence-and word-boundary detection, and word-boundaries were learned more quickly in all cases.This should come as no surprise, since any non-trivial corpus will contain more word boundaries than sentence boundaries, and thus provide more training data for detection of the former.Sentence boundaries were hardest to detect in the German corpus, which is assumedly due to the relatively high frequency of punctuation-free sentence boundaries in the TIGER corpus, in which over 10% of the sentence boundaries were not immediately preceded by a punctuation character,12 versus only 1% on average for the other corpora (σ = 0.78%).English and French were the most difficult corpora in terms of word boundary detection, most likely due to apostrophe-related phenomena including the English genitive marker 's and the contracted French article l'.

Experiment 3: Typographical Conventions
Despite the lack of typographical clues, the HMM tokenizer was able to successfully detect over 3300 of the unpunctuated sentence boundaries in the German TIGER corpus (pr = 94.1%,rc = 60.8%).While there is certainly room for improvement, the fact that such a simple model can perform so well in the absence of explicit sentence boundary markers is encouraging, especially in light of our intent to detect sentence boundaries in non-standard computer-mediated communication text, in which typographical markers are also frequently omitted.In order to get a clearer idea of the effect of typographical conventions on sentence boundary detection, we compared the HMM tokenizer's performance on the German TIGER corpus with and without both punctuation (±Punct) and letter-case (±Case), using cross-validation to train and test on appropriate data, with the results given in Table 8.
As hypothesized, both letter-case and punctuation provide useful information for sentence boundary detection: the model performed best for the original corpus retaining all punctuation and letter-case distinctions.Also unsurprisingly, punctuation was a more useful feature than letter-case for German sentence boundary detection,13 the [−Case, +Punct] variant achieving a harmonic precision-recall average of F = 92.23%.Even letter-case distinctions with no punctuation at all sufficed to identify about two thirds of the sentence boundaries with over 95% precision, however: this modest success is attributable primarily to the observations' stop features, since upper-cased sentence-initial stopwords are quite frequent and almost always indicate a preceding sentence boundary.

Experiment 4: Chat Tokenization
We now turn our attention to the task of segmenting a corpus of computer-mediated communication, namely the chat corpus subset described in Sec.3.1.Unlike the newspaper corpora used in the previous experiments, chat data is characterized by non-standard use of letter-case and punctuation: almost 37% of the sentences in the chat corpus were not terminated by a punctuation character, and almost 70% were not introduced by an upper-case letter.The chat data were subdivided into 10, 289 distinct observable utterance-like units we refer to as postings; of these, 9479 (92.1%) coincided with sentence boundaries, accounting for 83% of the sentence boundaries in the whole chat corpus.We measured the performance of the following five distinct HMM tokenizer models on sentence-and word-boundary detection for the chat corpus: • hmm: the standard model as described in Section 2, trained on a disjoint subset of the chat corpus and evaluated by cross-validation; • hmm[+force]: the standard model with a supplemental heuristic forcing insertion of a sentence boundary at every posting boundary; • hmm[+feat]: an extended model using all features described in Section 2.2.1 together with additional binary contextual surface features bou and eou encoding whether or not the corresponding segment occurs at the beginning or end of an individual posting, respectively; • tiger: the standard model trained on the entire TIGER newspaper corpus; and • tiger[+force]: the standard tiger model with the supplemental [+force] heuristic for sentence boundary insertion at every posting boundary.
Results for chat corpus sentence boundary detection are given in Table 9 and for word boundaries in Table 10.From these data, it is immediately clear that the standard model trained on conventional newspaper text (tiger) does not provide a satisfactory segmentation of the chat data on its own, incurring almost twice as many errors as the standard model trained by cross-validation on chat data (hmm).This supports our claim that chat data represent unconventional and non-standard uses of model-relevant features, in particular punctuation and capitalization.Otherwise, differences between the various cross-validation conditions hmm, hmm[+force], and hmm[+feat] with respect to word-boundary placement were minimal.
Sentence-boundary detection performance for the standard model (hmm) was similar to that observed in Section 3.5 for newspaper text with letter-case but without punctuation, EOS recall in particular remaining unsatisfactory at under 62%.Use of the supplemental [+force] heuristic to predict sentence boundaries at all posting boundaries raised recall for the newspaper model (tiger[+force]) to over 78%, and for the crossvalidation model (hmm[+force]) to almost 97%.The most balanced performance however was displayed by the extended model hmm[+feat] using surface features to represent the presence of posting boundaries: although its error rate was still quite high at almost 14%, the small size of the training subset compared to those used as reliable a clue to sentence boundary placement as it might be for English, which does not capitalize common nouns.for the newspaper corpora in Section 3.3 leaves some hope for improvement as more training data become available, given the typical learning curves from Figure 2.

Conclusion
We have presented a new method for detecting sentence and word token boundaries in running text by coupling a "prolix" rule-based scanner stage with a Hidden Markov Model over scan-segment feature bundles using hidden binary features bow, bos, and eos to represent the presence or absence of the corresponding boundaries.Language-specific features were limited to an optional set of user-specified stopwords, while the remaining observable surface features were used to represent basic typographical class, letter-case, word length, and leading whitespace.We compared our approach to the high-quality sentence boundary detector Punkt described by Kiss and Strunk (2006) on newspaper corpora from five different European languages, and found that the HMM boundary detector not only substantially out-performed Punkt for all languages in detection of both sentenceand word-boundaries, but even outdid Punkt on its "home ground" of dot-terminated words and sentences, providing average relative error reduction rates of 62% and 33%, respectively.Our technique exhibited a typical logarithmic learning curve, and was shown to adapt fairly well to varying typographical conventions given appropriate training data.
A small corpus of computer-mediated communication extracted from the Dortmund Chat Corpus (Beißwenger and Storrer, 2008) and manually segmented was introduced and shown to violate some typographical conventions commonly used for sentence boundary detection.Although the unmodified HMM boundary detector did not perform as well as hoped on these data, the inclusion of additional surface features sensitive to observable posting boundaries sufficed to achieve a harmonic precision-recall average F of over 92%, representing a relative error reduction rate of over 82% with respect to the standard model trained on newspaper text, and a relative error reduction rate of over 38% with respect to a naïve domain-specific splitting strategy.
s a shame, that Vienna is so far away; I would even have considered it.'

Figure 1 :
Figure 1: Different levels of representation for the German text fragment ". . .2,5 Mill.Eur.Durch die . .." depicted as a tree.Nodes of depth 0, 1, and 2 correspond to the levels of sentences, tokens, and segments, respectively.The feature values used by the HMM boundary detector are given below the corresponding segments.Boldface values of "hidden" features on a green background indicate the correct assignment, while gray values on a white background indicate possible but incorrect assignments.
usual conventions tp = |B relevant ∩ B retrieved | represents the number of true positive boundaries predicted by the tokenizer, fp = |B retrieved \B relevant | represents the number of false positives, and fn = |B relevant \B retrieved | represents the number of false negatives.

Table 1 :
Features used by tokenizer model.

Table 2 :
Typographical classes used by tokenizer model.

Table 3 :
Corpora used for training and evaluation.

Table 5 :
Overall performance on sentence boundary detection.

Table 7 :
Performance on word boundary detection for dot-final words.

Table 8 :
Effect of typographical conventions on sentence detection for the TIGER corpus (de).
Figure 2: Effect of training corpus size on sentence boundary detection (top) and word boundary detection (bottom).

Table 9 :
Effect of training source on sentence boundary detection for the chat corpus.

Table 10 :
Effect of training source on word boundary detection for the chat corpus.