GPT makes a poor AMR parser

Authors

DOI:

https://doi.org/10.21248/jlcl.38.2025.285

Keywords:

LLM, AMR parsing, prompting, explanation faithfulness

Abstract

This paper evaluates GPT models as out-of-the-box Abstract Meaning Representation (AMR) parsers using prompt-based strategies, including 0-shot, few-shot, Chain-of-Thought (CoT), and a two-step approach in which core arguments and non-core roles are handled separately. Our results show that GPT-3.5 and GPT-4o fall well short of state-of-the-art parsers, with a maximum Smatch score of 60 using GPT-4o in a 5-shot setting. While CoT prompting provides some interpretability, it does not improve performance. We further conduct fine-grained evaluations, revealing GPT’s limited ability to handle AMR-specific linguistic structures and complex semantic roles. Our
findings suggest that, despite recent advances, GPT models are not yet suitable as standalone AMR parsers.

Downloads

Published

2025-07-08

How to Cite

Li, Y., & Fowlie, M. (2025). GPT makes a poor AMR parser. Journal for Language Technology and Computational Linguistics, 38(2), 43–76. https://doi.org/10.21248/jlcl.38.2025.285