More Than Words: Using Token Context to Improve Canonicalization of Historical German

Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents diﬃculties for any system requiring reference to a static lexicon indexed by orthographic form. Canonicalization approaches seek to address these issues by associating one or more extant “canonical cognates” with each word of the input text and deferring application analysis to these canonical forms. Type-wise conﬂation techniques treating each input word in isolation often suﬀer from a pronounced precision–recall trade-oﬀ pattern: high-precision techniques such as conservative transliteration have comparatively poor recall, whereas high-recall techniques such as phonetic conﬂation tend to be disappointingly imprecise. In this paper, we present a technique for disambiguation of type conﬂation sets at the token level using a Hidden Markov Model whose lexical probability matrix is dynamically computed from the candidate conﬂations, and evaluate its performance on a manually annotated corpus of historical German

Traditional approaches to the problems arising from an attempt to incorporate historical text into such a system rely on the use of additional specialized (often application-specific) lexical resources to explicitly encode known historical variants. Such specialized lexica are not only costly and time-consuming to create, but also necessarily incomplete in the case of a morphologically productive language like German, since a simple finite lexicon cannot account for highly productive morphological processes such as nominal composition.
To facilitate the extension of synchronically-oriented natural language processing techniques to historical text while minimizing the need for specialized lexical resources, we may first attempt an automatic canonicalization of the input text. Canonicalization approaches [Jurish, 2008[Jurish, , 2010a treat orthographic variation phenomena in historical text as instances of an error-correction problem, seeking to map each (unknown) word of the input text to one or more extant canonical cognates: synchronically active types which preserve both the root and morphosyntactic features of the associated historical form(s). To the extent that the canonicalization was successful, application-specific processing can then proceed normally using the returned canonical forms as input, without any need for additional modifications to the application lexicon.
We distinguish between type-wise canonicalization techniques which process each input word independently and token-wise techniques which make use of the context in which a given instance of a word occurs. In this paper, we present a token-wise canonicalization method which functions as a disambiguator for sets of hypothesized canonical forms as returned by one or more subordinated type-wise techniques. Section 2 provides a brief review of the type-wise canonicalizers used to generate hypotheses, while Section 3 is dedicated to the formal characterization of the disambiguator itself. Section 4 contains a quantitative evaluation of the disambiguator's performance on an information retrieval task over a manually annotated corpus of historical German. Finally, Section 5 provides a brief summary and conclusion.

Type-wise Conflation
Type-wise canonicalization techniques are those which process each input word in isolation, independently of its surrounding context. Such a type-wise treatment allows efficient processing of large documents and corpora (since each input type need only be processed once), but disregards potentially useful context information. Formally, a type-wise canonicalization method r is fully specified by a characteristic conflation relation ∼ r , a binary relation on the set A * of all strings over the finite grapheme alphabet A. Prototypically, ∼ r will be a true equivalence relation, inducing a partitioning of the set A * of possible word types into equivalence classes or "conflation sets" [w] r = {v ∈ A * : v ∼ R w}. In the sequel, we will will use the term "conflation" as synonymous with "type-wise canonicalization", and "conflator" to refer to a specific type-wise canonicalization method.

String Identity
The simplest of all possible conflators is simple identity of surface strings. The conflation relation ∼ id is in this case nothing more or less than the string identity relation = itself: While string identity is the easiest conflator to implement (no additional programming effort or resources are required) and provides a high degree of precision, it cannot account for any graphematic variation at all, resulting in very poor recall. Nonetheless, its inclusion as a conflator ensures that the the set of candidate hypotheses [w] for a given input word w is non-empty, 1 and it provides a baseline with respect to which the relative utility of more sophisticated conflators can be evaluated.

Transliteration
A slightly less naïve family of conflation methods are those which employ a simple deterministic transliteration function to replace input characters which do not occur in contemporary orthography with extant equivalents. Formally, a transliteration conflator is defined in terms of a string transliteration function xlit : A * → A * , where A is as before a "universal" grapheme alphabet (e.g. the set of all Unicode characters) and A ⊂ A is that subset of the universal alphabet allowed by contemporary orthographic conventions: In the case of historical German, deterministic transliteration is especially useful for its ability to account for typographical phenomena, e.g. by mapping 'ſ' (long 's', as commonly appeared in texts typeset in fraktur) to a conventional round 's', and mapping superscript 'e' to the conventional umlaut diacritic '¨', as in the transliteration Abſt e ande → Abstände ("distances"). For the current work, we used a conservative transliteration function based on the Text::Unidecode Perl module 2 Although it rivals raw string identity in terms of its precision, such a conservative transliteration suffers from its inability to account for graphematic variation phenomena involving extant characters such as th/t and ey/ei alternations common in historical German.

Phonetization
A more powerful family of conflation methods is based on the dual intuitions that graphemic forms in historical text were constructed to reflect phonetic forms, and that the phonetic system of the target language is diachronically more stable than its graphematic system. Phonetic conflators map each (historical or extant) word w ∈ A * to a unique phonetic form pho(w) by means of a computable function pho : A * → P * , 3 conflating those strings which share a common phonetic form: Note that [w] pho may be infinite, if for example pho(·) maps any substring of one or more instances of a single character (e.g. 'a') to a single phon (e.g. /a/). It is useful in such cases to consider the restriction of the conflation set [w] pho to a finite set of target strings S ⊂ A * . We add the superscript " S" to the equivalence class to indicate such a restriction, [w] S pho = [w] pho ∩ S. The phonetic conversion module used here was adapted from the phonetization rule-set distributed with the IMS German Festival package [Möhler et al., 2001], a German language module for the Festival text-to-speech system [Black and Taylor, 1997]. 4 Phonetic conflation offers a substantial improvement in recall over conservative methods such as transliteration or string identity. Unfortunately, these improvements often come at the expense of precision.

Rewrite Transduction
Despite its comparatively high recall, the phonetic conflator fails to relate unknown historical forms with any extant equivalent whenever the graphematic variation leads to non-identity of the respective phonetic forms, suggesting that recall might be further improved by relaxing the strict identity criterion on the right hand side of Equation (3). Moreover, a fine-grained and appropriately parameterized conflator should be less susceptible to precision errors than an "all-or-nothing" (phonetic) identity condition. A technique which fulfills both of the above desiderata is rewrite transduction, which can be understood as a generalization of the well-known string edit distance [Levenshtein, 1966].
Formally, let Lex ⊆ A * be the (possibly infinite) lexicon of all extant forms, and let ∆ rw be a weighted finite-state transducer over a bounded semiring K which models (potential) diachronic change likelihood as a weighted rational relation. Then define for every input type w ∈ A * the "best" extant equivalent best rw (w) as the unique extant type v ∈ Lex with minimal edit-distance to the input word: best Ideally, the image of a word w under best rw will itself be the canonical cognate sought, leading to conflation of all strings which share a common image under best rw : For the current experiments, we used the heuristic rewrite transducer described in Jurish [2010a], compiled from 306 manually constructed SPE-style two-level rules, while the target lexicon Lex was extracted from the tagh morphology transducer [Geyken and Hanneforth, 2006]. Best-path lookup was performed using a specialized variant of the well-known Dĳkstra algorithm [Dĳkstra, 1959] as described in Jurish [2010b]. Although this rewrite cascade does indeed improve both precision and recall with respect to the phonetic conflator, these improvements are of comparatively small magnitude, precision in particular remaining well below the level of conservative conflators such as naïve string identity or transliteration.

Token-wise Disambiguation
In an effort to recover some degree of the precision offered by conservative conflation techniques such as transliteration while still benefiting from the flexibility and improved recall provided by more ambitious techniques such as phonetization or rewrite transduction, we have developed a method for disambiguating type-wise conflation sets which operates on the token level, using sentential context to determine a unique "best" canonical form for each input token. Specifically, the disambiguator employs a Hidden Markov Model (HMM) whose lexical probability matrix is dynamically re-computed for each input sentence from the conflation sets returned by one or more subordinated type-wise conflators, and whose transition probabilities are given by a static word n-gram model of the target language, i.e. present-day German adhering to current orthographic conventions.

Basic Model
Formally, let W ⊂ A * be a finite set of known extant words, let u ∈ W be a designated symbol representing an unknown word, let S = w 1 , . . . , w n S be an input sentence of n S (historical) words with w i ∈ A * for 1 ≤ i ≤ n S , and let R = {r 1 , . . . , r n R } be a finite set of (opaque) type-wise conflators. Then, the disambiguator HMM is defined in the usual way [Rabiner, 1989, Charniak et al., 1993, Manning and Schütze, 1999  3. Π : Q → [0, 1] : q → p(Q 1 = q) is a static probability distribution over Q representing the model's initial state probabilities; is a static conditional probability distribution over Q k-grams representing the model's state transition probabilities; and is a dynamic probability distribution over observations conditioned on states representing the model's lexical probabilities.

Transition Probabilities
The finite target lexicon W can easily be extracted from a corpus of contemporary text. For estimating the static distributions Π and A, we first make the following assumptions: Equation 6 asserts the independence of extant forms and conflators, while Equation 7 assumes a uniform distribution over conflators. Given these assumptions, the static state distributions Π and A can be estimated as: Equations (8) and (9) are nothing more or less than a word k-gram model over extant forms, scaled by the constant 1 n R . We can therefore use standard maximum likelihood techniques to estimate Π and A from a corpus of contemporary text [Bahl et al., 1983, Manning andSchütze, 1999].
For the current experiments, we trained a word trigram model (k = 3) on the tiger corpus of contemporary German [Brants et al., 2002]. Probabilities for the "unknown" form u were computed using the simple smoothing technique of assigning u a pseudo-frequency of 1 2 [Lidstone, 1920, Manning andSchütze, 1999]. To account for unseen trigrams, the resulting trigram model was smoothed by linear interpolation of uni-, bi-, and trigrams Mercer, 1980, 1985], using the method described by Brants [2000] to estimate the interpolation coefficients.

Lexical Probabilities
In the absence of a representative corpus of conflator-specific manually annotated training data, we cannot use maximum likelihood techniques to estimate the model's lexical probabilities B S . Instead, lexical probabilities are instantiated as a Maxwell-Boltzmann distribution: Here, b, β ∈ R are free model parameters with β < 0 < b, and for a conflator r ∈ R, the function d r : A * × W → R + is a pseudo-metric used to estimate the reliability of the conflator's association of an input word w with the extant form w It should be explicitly noted that the denominator of the right-hand side of Equation (10) is a sum over all model states (canonicalization hypotheses) w , r actually associated with the observation argument w by the type-wise conflation stage, and not a sum over observations w associable with the state argument w, r . This latter sum (if it could be computed) would adhere to the traditional form sim(o, q)/ o sim(o , q) for estimating a probability distribution p(O|Q) over observations conditioned on model states such as the HMM lexical probability matrix B S is defined to represent; whereas the estimator in Equation (10) is of the form sim(o, q)/ q sim(o, q ) , which corresponds more closely to a distribution p(Q|O) over states conditioned on observations. 5 From a practical standpoint, it should be clear that Equation (10) is much more efficient to compute than an estimator summing globally over potential observations, since all the data needed to compute Equation (10) are provided by the type-wise preprocessing of the input sentence S itself, whereas a theoretically pure global estimator would require a whole arsenal of inverse conflators as well as a mechanism for restricting their outputs to some tractable set of admissable historical forms, and hence would be of little practical use. From a formal standpoint, we believe that our estimator as used in the run-time disambiguator can be shown to be equivalent to a global estimator, provided that the conflator pseudo-metrics d r are symmetric and the languages of both historical and extant forms are uniformly dense, but a proof of this conjecture is beyond the scope of the current work.
It was noted above in Section 2.3 that the for the phonetic conflator in particular, the equivalence class [w] pho = {v ∈ A * : w ∼ pho v} may not be finite. In order to ensure the computational tractability of Equation (10) therefore, the phonetic conflations considered were implicitly restricted to the finite set W of known extant forms used to define the model's states, [w] W pho . Transliterations and rewrite targets which were not also known extant forms were implicitly mapped to the designated symbol u for purposes of estimating transition probabilities for previously unseen extant word types.
For the current experiments, we used the following model parameters: In all other cases, d r (w,w) is undefined. Note that all conflator distance functions are scaled by inverse input word length 1 |w| . Defining distance functions in terms of (inverse) word length in this manner captures the intuition that a conflator is less likely to discover a false positive conflation for a longer input word than for a short one; natural language lexica tending to be maximally dense for short (usually closed-class) words. The transliteration and phonetic conflators are constants given input word length, whereas the rewrite conflator makes use of the cost ∆ rw (w,w) assigned to the conflation pair by the rewrite FST itself.

Runtime Disambiguation
Having defined the disambiguator model, we can use it to determine a unique "best" canonical form for each input sentence S by applying the well-known

More Than Words
Viterbi algorithm [Viterbi, 1967]. Formally, the Viterbi algorithm computes the state path with maximal probability given the observed sentence: Finally, extracting the disambiguated canonical forms from the state sequence Q returned by the Viterbi algorithm is a trivial matter of projecting the extant form components of the HMM state structures:

Test Corpus
The conflation and disambiguation techniques described above were tested on a manually annotated corpus of historical German. The test corpus was comprised of the full body text from 13 volumes published between 1780 and 1880, and contained 152,776 tokens of 17,417 distinct types in 9,079 sentences, discounting non-alphabetic types such as punctuation. To assign an extant canonical equivalent to each token of the test corpus, the text of each volume was automatically aligned token-wise with a contemporary edition of the same volume. Automatically discovered non-identity alignment pair types were presented to a human annotator for confirmation. In a second annotation pass, all tokens lacking an identical or manually confirmed alignment target were inspected in context and manually assigned a canonical form. Whenever they were presented to a user, proper names and extinct lexemes were treated as their own canonical forms. In all other cases, equivalence was determined by direct etymological relation of the root in addition to matching morphosyntactic features. Problematic tokens were marked as such and subjected to expert review. Marginalia, front and back matter, speaker and stage directions, and tokenization errors were excluded from the final evaluation corpus.

Evaluation Measures
The canonicalization methods from Sections 2 and 3 were evaluated using the gold-standard test corpus to simulate a document indexing and query scenario.
Formally, let C = {c 1 , . . . , c n C } be a finite set of canonicalizers, and let G = S 1 , . . . , S n G represent the sentences of the test corpus, where each sentence S i = g i;1 , . . . , g i;n S i is a string of token-tuples Here, w i;j represents the literal token text as appearing in the historical corpus,w i;j is its gold-standard canonical cognate, and [w i;j ] c k represents the set of canonical form(s) assigned to the token by the canonicalizer c k . Let Q = n G i=1 n S i j=1 {w i;j } be the set of all canonical cognates represented in the corpus, and define for each canonicalizer c ∈ C and query string q ∈ Q the sets relevant(q), retrieved c (q) ⊆ N 2 of relevant and retrieved corpus tokens as: Token-wise precision and recall for the canonicalizer c can then be defined as: Type-wise measures are defined analogously, by mapping the token index sets of Equations (13) and (14) to corpus types before applying Equations (15) and (16). We use the unweighted harmonic precision-recall average F [van Rĳsbergen, 1979] as a composite measure for both type-and token-wise evaluation modes: F(pr, rc) = 2 · pr · rc pr + rc (17)

Results
Qualitative results for the canonicalization techniques described in Sections 2 and 3 with respect to the test corpus are given in Table 1. Immediately apparent from the data is the typical precision-recall trade-off pattern discussed above: conservative conflators such as string identity (id) and transliteration (xlit) have near-perfect precision (≥ 99% both type-and token-wise), but relatively poor recall. On the other hand, ambitious conflators such as phonetic identity (pho) or the heuristic rewrite transducer (rw) reduce type-wise recall errors by over 66% and token-wise recall errors by over 75%, with respect to transliteration, but these recall gains come at the expense of precision. As hoped, the HMM disambiguator (hmm) presented in Section 3 does indeed recover a large degree of the precision lost by the ambitious type-wise conflators, achieving a reduction of over 41% in type-wise precision errors and over 94% in token-wise precision errors with respect to the heuristic rewrite conflator. While some additional recall errors are made by the HMM, there are comparatively few of these, so that the harmonic average F falls by a mere 3% with respect to the highest-recall method (rw). Indeed, the token-wise composite measure F is substantially higher for the HMM disambiguator (99.4%, versus 96.7% for the rewrite method), outperforming its closest competitor -deterministic transliteration (xlit) -by over 64%.
The most surprising aspect of these results is the recall performance of the conservative transliterator xlit with rc tok = 96.8%. While such performance combined with the ease of implementation and computational efficiency of the transliteration method makes it very attractive at first glance, note that the test corpus was drawn from a comparatively recent text sample, and that a diachronically more heterogeneous corpus such as that described in Jurish [2010a] is likely to be less amenable to such simple techniques.

Conclusion
We have identified a typical precision-recall trade-off pattern exhibited by several type-wise conflation techniques used to automatically discover extant canonical forms for historical German text. Conservative conflators such as string identity and transliteration return very precise results, but suffer from comparatively poor recall. More ambitious techniques such as conflation by phonetic form or heuristic rewrite transduction show a marked improvement in recall, but disappointingly poor precision. To address these problems, we proposed a method for disambiguating type conflation sets at the token level using sentential context to optimize the path probability of canonical forms conditioned on observed historical forms. The disambiguator uses a Hidden Markov Model whose lexical probabilities are dynamically re-computed for every input sentence based on the conflation hypotheses returned by a set of subordinated type-wise conflators.
The proposed disambiguation architecture was evaluated on an information retrieval task over a gold standard corpus of manually confirmed canonicalizations of historical German text. Use of the token-wise disambiguator provided a precision error reduction of over 94% with respect to the best recall method, and a recall error reduction of over 71% with respect to the most precise method. Overall, the proposed disambiguation method performed best at the token level, achieving a token-wise F of 99.4%.
We are interested in verifying our results using larger and less homogeneous corpora than the test corpus used here, as well as extending the techniques described here to other languages and domains. In future work, we wish to implement and test a language-independent type-wise conflator such as that described by Kondrak [2000], and to systematically investigate the effects of the various disambiguator parameters as well as more sophisticated smoothing techniques for handling previously unseen extant types and sparse training data.