Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts

Tobias Englmeier; Florian Fink; Uwe Springmann; Klaus U. Schulz

doi:10.21248/jlcl.35.2022.232

Authors

Tobias Englmeier CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München
Florian Fink CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München
Uwe Springmann CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München
Klaus U. Schulz CIS - Centrum für Informations- und Sprachverarbeitung - Ludwig-Maximilians-Universität München

DOI:

https://doi.org/10.21248/jlcl.35.2022.232

Abstract

Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a statistical profile for an OCR-ed input text. We also look in detail at the influence of the profiler resources and other settings selected for training and evaluation. As a practical result of several fine-tuning steps, a general post-correction model is achieved where experiments for a large and heterogeneous collection of OCR-ed historical texts show a consistent improvement of base OCR accuracy. The results presented are meant to provide insights for libraries that want to apply OCR post-correction to a larger spectrum of distinct OCR-ed historical printings and ask for "representative" results.

Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

License