Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts

Authors

  • Tobias Englmeier CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München
  • Florian Fink CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München
  • Uwe Springmann CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München
  • Klaus U. Schulz CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München

DOI:

https://doi.org/10.21248/jlcl.35.2022.232

Abstract

Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a statistical profile for an OCR-ed input text. We also look in detail at the influence of the profiler resources and other settings selected for training and evaluation. As a practical result of several fine-tuning steps, a general post-correction model is achieved where experiments for a large and heterogeneous collection of OCR-ed historical texts show a consistent improvement of base OCR accuracy. The results presented are meant to provide insights for libraries that want to apply OCR post-correction to a larger spectrum of distinct OCR-ed historical printings and ask for "representative" results.

Downloads

Published

2022-12-03 — Updated on 2022-12-03

How to Cite

Englmeier, T., Fink, F., Springmann, U., & Schulz, K. U. (2022). Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts. Journal for Language Technology and Computational Linguistics, 35(1), 1–27. https://doi.org/10.21248/jlcl.35.2022.232

Issue

Section

Research articles