Speaker attribution in German parliamentary debates with QLoRA-adapted large language models

The growing body of political texts opens up new opportunities for rich insights into political dynamics and ideologies but also increases the workload for manual analysis. Automated speaker attribution, which detects who said what to whom in a speech event and is closely related to semantic role labeling, is an important processing step for computational text analysis. We study the potential of the large language model family Llama 2 to automate speaker attribution in German parliamentary debates from 2017-2021. We fine-tune Llama 2 with QLoRA, an efficient training strategy, and observe our approach to achieve competitive performance in the GermEval 2023 Shared Task On Speaker Attribution in German News Articles and Parliamentary Debates. Our results shed light on the capabilities of large language models in automating speaker attribution, revealing a promising avenue for computational analysis of political discourse and the development of semantic role labeling systems.


Introduction
Language is central to the study of politics, as it forms the basis for political speech and debates (Grimmer & Stewart, 2013).These textual sources offer rich insights into political dynamics and ideologies, yet the analysis of even moderately sized collections has been impeded by prohibitive costs.Recent innovations from natural language processing (NLP) have the potential to significantly reduce the financial burden of scrutinizing extensive text corpora (Glavaš, Nanni, & Ponzetto, 2019;Abercrombie & Batista-Navarro, 2020).This development coincides with the availability of a growing body of political texts, including German Parliamentary data (Barbaresi, 2018;Blätte & Blessing, 2018;Walter et al., 2021;Rauh & Schwalbach, 2020;Abrami, Bagci, Hammerla, & Mehler, 2022;Rehbein et al., 2023), thus opening new avenues for political research.
Political texts are usually unstructured, presenting challenges for automated analyses.An approach towards this challenge is automated speaker attribution (Rehbein et al., 2023), which detects who said what to whom in a speech event.This process involves detecting cue words that initiate a speech event and discerning the different roles (e.g., source, message, and addressee) associated with each event.This task is closely related to semantic role labeling (SRL) that delineates the specific semantic relationships among a predicate and its corresponding arguments, such as "who" did "what" to "whom", "where", "when", and "why" (Gildea & Jurafsky, 2002;Màrquez, Carreras, Litkowski, & Stevenson, 2008).Semantic role labeling is considered a key component for natural language understanding and has been demonstrated to enhance systems for various applications including question answering, machine translation, and video understanding (Navigli, Barba, Conia, & Blloshmi, 2022).

Bornheim, Grieger, Blaneck, Bialonski
Early approaches to SRL relied on syntactic features (Navigli et al., 2022;Larionov, Shelmanov, Chistova, & Smirnov, 2019).More recently, the field has seen a significant transition from such engineered features to features learned in an end-to-end fashion by models that operate on rawlevel input or tokens (Collobert et al., 2011).However, such end-to-end models necessitate large annotated training sets, available for English but scarce for low-resource languages.This problem can be mitigated by pretraining on unannotated data.Indeed, the emergence of pretrained large language models (LLMs) inspired by the transformer architecture (Vaswani et al., 2017) led to new state-of-the-art results across various NLP tasks.Among these, encoder-only models like BERT were demonstrated to improve existing SRL benchmarks (Shi & Lin, 2019).More recently, the advent of decoder-only models, such as GPT (Radford & Narasimhan, 2018) and larger models like GPT-4 (OpenAI, 2023), Claude 2 (Bai et al., 2022), and Llama 2 (Touvron, Martin, et al., 2023), has further propelled the field.These models, with their ability to comprehend and execute instructions in natural language for a wide array of tasks, hold potential for SRL and automated speaker attribution that is, to the best of our knowledge, largely unexplored.
In this contribution, we study the potential of Llama 2 70B, a model from a recently introduced family of large language models, to automatically detect speech events and attribute speakers in German parliamentary debates.We instruct and fine-tune Llama 2 to extract cues and roles using QLoRA (Dettmers, Pagnoni, Holtzman, & Zettlemoyer, 2023), a parameter-and computationally efficient training strategy.Our approach achieves competitive performance (quantified by F1 scores for cues and roles) on the SpkAtt-2023 dataset of the GermEval 2023 Shared Task on Speaker Attribution in German News Articles and Parliamentary Debates (Rehbein et al., 2023).The implementation details of our experiments (Team "CPAa") are available online 1 .

Data and tasks
The dataset of the GermEval 2023 Shared Task on Speaker Attribution in German News Articles and Parliamentary Debates consisted of 267 speeches from the German Bundestag (Rehbein et al., 2023).This dataset included speeches from all seven parliamentary groups (including independent members of parliament as a separate group) of the 19th legislative period of the German Bundestag (see Table 1 for details).To facilitate analysis, each speech was automatically separated into sentence-like structures using spaCy, hereafter referred to as samples (units of analysis).Each sample was then further split into elements, i.e., words and punctuation marks.
Human annotators followed annotation guidelines 2 to assign none, one, or multiple annotations to each sample.These annotations consisted of cue words that invoke speech events and roles (Addr, Evidence, Medium, Message, Source, Topic, PTC) associated with that event.While the cue is mandatory for each annotation, roles are context-dependent and may be absent.Figure 1 shows example annotations.
The Shared Task consisted of two subtasks: Full Annotation (Subtask 1) and Role Detection (Subtask 2) (Rehbein et al., 2023)  and roles for each sample.In the Role Detection subtask, the gold cues were given, and the goal was to predict only the roles for each sample.
The dataset was provided as five sets, namely Trial, Train, Dev, and two Eval sets (see Table 2).We omitted the Trial set in our experiments, since it was included in the Train set.For training and tuning the final models, we used the Train and Dev sets.The two Eval sets were used by the GermEval 2023 organizers to compute the final scores for Subtask 1 (Eval set 1) and Subtask 2 (Eval set 2).While the two Eval sets contained the same samples, the organizers provided gold cues with Eval set 2.

Models
We used the Llama 2 model family (Touvron, Martin, et al., 2023), a set of large language models pretrained on a corpus of two trillion tokens with a context length of 4096 tokens.The Llama 2 model family includes both pretrained models and fine-tuned versions optimized for conversational tasks.Since our approach did not require the conversational capabilities of the fine-tuned models, we chose to use the base pretrained versions of Llama 2 in our experiments.These base models Bornheim, Grieger, Blaneck, Bialonski Annotation 1 Von der AfD wollen wir hier lieber nicht reden; ‡ denn wir (Source) wissen (Cue) : Neben ihren rassistischen Positionen ‡ haben die Rechtsradikalen nicht nur Klimawandelleugnung im Angebot, sie haben auch die rechtspopulistischen Positionen eines Donald Trump gepachtet (Message) .
Figure 1: Sentence from the Train dataset with three annotations.The sentence was split into three samples by spaCy (splitting points are indicated by ‡).This segmentation also occurs at not-punctuated positions, as seen in the example sentence (". . .rassistischen Positionen ‡ haben die Rechtsradikalen . . .").This behavior is due to the data provided by "Open Bundestag", where comments from other members of parliament during an otherwise coherent paragraph force this unintuitive segmentation into two separate paragraphs (Rehbein et al., 2023).As seen in Annotation 2, there can be annotations consisting of only cue word(s).Annotation 1 and Annotation 3 show that annotated roles can span multiple samples.
were trained without a specific prompt format and are therefore not biased toward any particular prompt strategy, allowing us to freely choose our own prompt format.
While the Llama 2 model family contains models of various sizes, we chose to fine-tune the largest available model with 70 billion parameters (Llama 2 70B).The weights of this model can be obtained upon request using the official GitHub repository3 .Once downloaded, we followed the provided instructions4 to convert the model to the HuggingFace Transformers format (Wolf et al., 2020).This conversion allowed us to load the model using the HuggingFace Transformers library, which facilitated the fine-tuning and inference steps.

Preprocessing
For effective training (see section 3.3) and inference (see section 3.4) we preprocessed each sample.We parsed each annotation into its respective lists of elements.Next, we joined all elements of a sample with space characters in between to get each sample's text.Since roles can be contained in samples different from the one containing the cue, we concatenated the sample with the next two samples of the same speech, if possible.
During our experiments, we noticed that our models ignored their instructions and generated random text if the text of a given sample ended with a colon.To counteract this behavior, we replaced this trailing colon with a period.
Speaker Attribution in German Parliamentary Debates with QLoRA-adapted Large Language Models

Input:
User: A cue is the lexical items in a sentence that indicate that speech, writing, or thought is being reproduced.I want you to extract all cues in the text below.If you find multiple words for one cue, you output them separated by commas.If no cue can be found in the given text, you output the string #UNK# as cue.Now extract all cues from the following sentence.Use the prefix "Cues: ".Sentence: denn wir wissen: Neben ihren rassistischen Positionen Assistant: Neben ihren rassistischen Positionen" with the cue "wissen".Since roles can be contained in samples different from the one containing the cue, we concatenated the sample with the next two samples of the same speech (transitions between samples are indicated by ‡).Shaded in gray are the parts of the prompt and response that are sample dependent.Similar to the cue prompt, the role prompt is used as the Input sequence for training and inference, while the Output sequence contains the desired response.We append the end-of-sentence token "</s>" to the Output.
Bornheim, Grieger, Blaneck, Bialonski We designed prompts for cue prompting (see Figure 2) and role prompting (see Figure 3).We wrote the instructions in our prompt templates in English, because it was observed that the performance of multilingual models such as Llama 2 is improved when English prompts are used (Fu, Ng, & Liu, 2022;Huang et al., 2023).Also, since a sample may not contain a cue, or a role may be missing, we used "#UNK#" to mark such cases.

Training
For our final submission, we fine-tuned two Llama 2 70B models to identify cues and roles, respectively, using QLoRA (Quantized Low-Rank Adaptation) (Dettmers et al., 2023).QLoRA is a highly efficient fine-tuning technique for large language models that achieves similar performance to full fine-tuning while using only a fraction of the memory.This memory reduction is achieved by quantizing the model weights of an LLM to four bits and adding Low Rank Adapters (LoRA layers) to all linear transformer blocks of the model.During only these LoRA layers are trained and the rest of the pretrained model weights remain unaltered.By employing this strategy, QLoRA achieves a significant reduction in memory usage during fine-tuning, while still allowing the model to adapt to downstream tasks through the trainable LoRA layers.
As described in Section 3.2, we parsed the training samples into cue prompts (see Figure 2) that served as input to the cue model and role prompts (see Figure 3) that served as input to the role model.Utilizing these input prompts, the respective models were trained to predict the desired assistant responses (defined as Output in Figures 2 and 3).This approach is consistent with previous research that has shown improved performance when fine-tuning only on the target response of an instruction set, rather than both the instructions and the desired response (Dettmers et al., 2023).By treating the input and output separately, we can process the two sequences with different maximum sequence lengths.Specifically, for the model used to identify cues, we set the maximum length of the input to 256 tokens (with seven samples of the training data truncated) and the maximum length of the output to 64 tokens (no samples truncated).For the model used to identify roles, we truncated the input to 640 tokens (with six samples of the training data truncated) and the output to 256 tokens (with one sample truncated).
Except for the maximum number of tokens in the input and output sequences, we largely followed the training strategy proposed in Dettmers et al. (2023).Although their specific experiments did not involve a Llama 2 70B model, they successfully fine-tuned a similarly sized LLaMA model (predecessor to Llama 2) with 65 billion parameters (Touvron, Lavril, et al., 2023).We adopted most parameters from this 65B model fine-tuning, such as a constant learning rate of η = 0.0001 with linear warmup over the first 3% of training steps and a dropout of 0.05 for the LoRA layers.The main hyperparameter we adjusted was the number of training steps to prevent overfitting.For the cues model, we trained for 2000 steps with a batch size of 16 and no gradient accumulation.For the roles model, we used 2500 steps with a batch size of eight and gradient accumulation over two steps, i.e., an effective batch size of 16.
Fine-tuning was carried out on a DGX A100 server, with a total training time of about seven hours for the cues model and 17 hours for the roles model.To optimize memory usage, we experimented with reducing the batch size to one while increasing the gradient accumulation steps to 16 (i.e., maintaining the same effective batch size).With these parameters, both models were able to operate within a GPU memory limit of less than 60 GB.

Inference
Prompting our fine-tuned models was a two-step process.In the first step, we prompted our cue model for all cues in a sample using our prompt template for cues (see Figure 2).We postprocessed the output of the model (see section 3.5) into a list of cues.In the second step, for each cue, we prompted for the roles with our role model.To do this, we prepended the complete cue prompt and its output to the role prompt template before querying the model (see Figure 3).
To ensure reproducibility of results, we configured our models to generate output deterministically.For a given input sequence, large language models obtain a probability distribution over all possible tokens.We chose to always select the token with the highest assigned probability as the next output token, thereby fixing the output for a given input sequence.

Postprocessing and evaluation metrics
Several postprocessing steps were necessary to evaluate the models' output in a structured way.
Enforcing the output format.If the models' output did not follow our strict output format (see Figures 2 and 3), we mapped the output to the marker #UNK# (unknown).
Preventing overlapping cues.If our cue model detected multiple but overlapping cues, we combined them into a single cue.
Ignoring made-up words.If the output of the model contained words for cues or roles that were not in the given sample, and no other word with a Levenshtein distance of 1 was found in the sample, we ignored those words.Then, if the output was empty, we mapped the output to the marker #UNK# (unknown).
Resolving ambiguities.A word may occur more than once in a sample.When a model outputs such a word as a cue or a role, it is unclear to which occurrence of the word in the sample it should be attributed.To resolve this ambiguity, for each occurrence of the word, we counted how many elements around that word (in the range of two elements to the left and right) were part of the cue or role, and chose the occurrence with the highest count.
Including surrounded punctuation.Roles often contained punctuation marks such as colons or commas.We observed that our models ignored these punctuation marks most of the time.If a punctuation mark was surrounded by words that were selected for this role, we added that punctuation mark to the role as well.Evaluating metrics.To evaluate the performance of our models, we used the proportional F1 score as proposed for opinion role labeling (Johansson & Moschitti, 2010).This score is defined as the harmonic mean of the proportional precision and recall.Proportional precision quantifies the proportion of overlap between a predicted cue (role) and an overlapping true cue (role).Proportional recall quantifies the proportion of overlap between a true cue (role) and an overlapping predicted cue (role; see Rehbein et al. (2023) for further details on how the proportional F1 score is calculated).

Results
We used the same fine-tuned Llama 2 70B models for both Subtask 1 and Subtask 2 of GermEval 2023 Shared Task 1 -a cues model to identify cues in a given sentence and a roles model to predict the roles associated with the identified cues.While the cues model was used exclusively in Subtask 1, as the cues were provided in Subtask 2, the roles model was used in both subtasks.It leveraged either the predicted cues from Subtask 1 or the gold cues from Subtask 2 to predict the roles associated with each cue, as described in section 3.4.By using the same fine-tuned roles model for both subtasks, we were able to analyze the impact of using gold cues versus predicted cues on role identification performance.
Table 3 shows the final results of our submissions on the Eval dataset, as reported by the organizers of the GermEval 2023 Shared Task.For Subtask 1, the fine-tuned cues model achieved an F1 score of 0.889 for predicting cues.Using the predicted cues from this model, the fine-tuned roles model achieved an F1 score of 0.804 for predicting roles.Combining both predictions, our models achieved an overall F1 score of 0.813 for predicting cues and roles in Subtask 1.In Subtask 2, where gold cues were provided, the same roles model used in Subtask 1 achieved a higher F1 score of 0.891 for predicting roles.Interestingly, the improvement of the roles model using gold cues was greater in precision, which increased from 0.787 to 0.910, than in recall, which increased from 0.822 to 0.873.This increase in precision suggests that the cues model in Subtask 1 overpredicted sentences as containing cues when they actually had no cues, resulting in too many false positive role predictions.
Speaker Attribution in German Parliamentary Debates with QLoRA-adapted Large Language Models In summary, our results demonstrate that our fine-tuned models are effective at reliably predicting cues and roles.Additionally, the results highlight the importance of accurate cue prediction, as errors of the cues model propagate to the roles model, reducing its performance.

Conclusion
We demonstrated that fine-tuned Llama 2 language models can successfully predict cues and roles in German parliamentary debates, achieving competitive performance on the GermEval2023 Shared Task without relying on traditional linguistic features.These results highlight the feasibility of automated speaker attribution by fine-tuning models on prompt templates that task them with identifying cues and roles.The similarity between automated speaker attribution and semantic role labeling suggests that this strategy may pave the way for new state-of-the-art results in various semantic role labeling tasks.

Limitations
We did not study risks that may or may not arise when our fine-tuned large language models are used for other application scenarios than ours.In our approach, users can neither manipulate the prompts nor read the generated texts produced by our models.Instead, the generated outputs are processed and mapped back to the words from the parliamentary speeches used as input.Therefore, we consider the risks associated with our approach to be limited.We recommend security testing if our trained models are to be used in other scenarios.

Figure 2 :Figure 3 :
Figure 2: Example cue prompt and desired model response for the sample "denn wir wissen: Neben ihren rassistischen Positionen" with the cues "wissen" and "Positionen".Shaded in gray are the parts of the prompt and response that are sample dependent.The prompt is used as the Input sequence for training and inference, while the Output sequence contains the desired response with the cues.The end-of-sentence token "</s>" is used to indicate the end of the Output sequence.

Table 1 :
. In the Full Annotation subtask, the goal was to predict all cues Number of speeches and samples per parliamentary group in the combined Train, Dev, and Eval datasets.

Table 2 :
Number of speeches, samples (units of analysis), and annotations for each dataset.The Trial dataset is completely contained within the Train dataset and is therefore not shown.
The Eval dataset here refers to the test sets of both Subtask 1 and Subtask 2, since they only differ in the provided annotations.

Table 3 :
Proportional precision, recall, and F1 scores obtained for predicting cues and roles on the Eval dataset.The joint scores for predicting both cues and roles (Subtask 1 of GermEval 2023 Shared Task 1) are shown in the third row.The last row shows the results obtained for predicting roles on the Eval dataset when the true cues were given (Subtask 2).