Discourse Segmentation of German Text with Pretrained Language Models

Steffen Frenzel; Maximilian Krupop; Manfred Stede

doi:10.21248/jlcl.39.2026.306

Authors

Steffen Frenzel University of Potsdam https://orcid.org/0009-0006-6308-993X
Maximilian Krupop University of Potsdam https://orcid.org/0009-0009-9474-6236
Manfred Stede University of Potsdam https://orcid.org/0000-0001-6819-2043

DOI:

https://doi.org/10.21248/jlcl.39.2026.306

Keywords:

discourse, segmentation, German, Large Language Models

Abstract

Segmenting text into so-called "elementary discourse units" (EDUs) is a task that is relevant for several NLP applications, including discourse parsing or argument mining. In recent years, EDU segmentation has been addressed as part of a shared task on multilingual discourse parsing ("DISRPT"), where BERT-based encoder models proved particularly successful. The German language has been represented in DISRPT with the Potsdam Commentary Corpus, but recently, more German data with EDU segmentation has been published. In this paper, we conduct detailed tests on the German-language datasets that are currently available. We test a multilingual off-the-shelf model, several BERT-based encoders, and the current generation of LLMs. The results are analyzed both qualitatively and quantitatively and are compared to the multilingual state-of-the-art. We are making the best-performing model available as a tool that can be used by the community.

Discourse Segmentation of German Text with Pretrained Language Models

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License