GPT makes a poor AMR parser
DOI:
https://doi.org/10.21248/jlcl.38.2025.285Keywords:
LLM, AMR parsing, prompting, explanation faithfulnessAbstract
This paper evaluates GPT models as out-of-the-box Abstract Meaning Representation (AMR) parsers using prompt-based strategies, including 0-shot, few-shot, Chain-of-Thought (CoT), and a two-step approach in which core arguments and non-core roles are handled separately. Our results show that GPT-3.5 and GPT-4o fall well short of state-of-the-art parsers, with a maximum Smatch score of 60 using GPT-4o in a 5-shot setting. While CoT prompting provides some interpretability, it does not improve performance. We further conduct fine-grained evaluations, revealing GPT’s limited ability to handle AMR-specific linguistic structures and complex semantic roles. Our
findings suggest that, despite recent advances, GPT models are not yet suitable as standalone AMR parsers.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Yanming Li, Meaghan Fowlie

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.