Measuring the Contributions of Vision and Text Modalities

Letitia Parcalabescu

doi:10.21248/jlcl.38.2025.261

Measuring the Contributions of Vision and Text Modalities

Authors

Letitia Parcalabescu Aleph Alpha IPAI https://orcid.org/0000-0002-3892-5629

DOI:

https://doi.org/10.21248/jlcl.38.2025.261

Keywords:

vision and language, interpretability, explainability

Abstract

This dissertation investigates multimodal transformers that process both image and text modalities together to generate outputs for various tasks (such as answering questions about images). Specifically, methods are developed to assess the effectiveness of vision and language models in combining, understanding, utilizing, and explaining information from these two modalities. The dissertation contributes to the advancement of the field in three ways: (i) by measuring specific and task-independent capabilities of vision and language models, (ii) by interpreting these models to quantify the extent to which they use and integrate information from both modalities, and (iii) by evaluating their ability to provide self-consistent explanations of their outputs to users.

Downloads

Published

2025-02-27 — Updated on 2025-02-27

Versions

2025-02-27 (2)
2025-02-27 (1)

How to Cite

Parcalabescu, L. (2025). Measuring the Contributions of Vision and Text Modalities. Journal for Language Technology and Computational Linguistics, 38(1), 1–3. https://doi.org/10.21248/jlcl.38.2025.261

Download Citation

Issue

Vol. 38 No. 1 (2025)

Section

Dissertation Abstracts

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.