GPT makes a poor AMR parser

Yanming Li; Meaghan Fowlie

doi:10.21248/jlcl.38.2025.285

GPT makes a poor AMR parser

Authors

Yanming Li Inria Saclay, INSA CVL, Paris-Saclay University - PETSCRAFT https://orcid.org/0009-0000-3747-6369
Meaghan Fowlie Utrecht University - Department of Languages, Literature and Communication https://orcid.org/0000-0002-9931-4485

DOI:

https://doi.org/10.21248/jlcl.38.2025.285

Keywords:

LLM, AMR parsing, prompting, explanation faithfulness

Abstract

This paper evaluates GPT models as out-of-the-box Abstract Meaning Representation (AMR) parsers using prompt-based strategies, including 0-shot, few-shot, Chain-of-Thought (CoT), and a two-step approach in which core arguments and non-core roles are handled separately. Our results show that GPT-3.5 and GPT-4o fall well short of state-of-the-art parsers, with a maximum Smatch score of 60 using GPT-4o in a 5-shot setting. While CoT prompting provides some interpretability, it does not improve performance. We further conduct fine-grained evaluations, revealing GPT’s limited ability to handle AMR-specific linguistic structures and complex semantic roles. Our
findings suggest that, despite recent advances, GPT models are not yet suitable as standalone AMR parsers.

Downloads

Published

2025-07-08

How to Cite

Li, Y., & Fowlie, M. (2025). GPT makes a poor AMR parser. Journal for Language Technology and Computational Linguistics, 38(2), 43–76. https://doi.org/10.21248/jlcl.38.2025.285

Download Citation

Issue

Vol. 38 No. 2 (2025): LLM fails – Failed experiments with generative AI and what we can learn from them

Section

Research articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.