Integrating Visual Modalities with Large Language Models for Mental Health Support

Zhouan Zhu, Shangfei Wang, Yuxin Wang, Jiaqiang Wu

January 2025

Abstract

Current work of mental health support primarily utilizes unimodal textual data and often fails to understand and respond to users’ emotional states comprehensively. In this study, we introduce a novel framework that enhances Large Language Model (LLM) performance in mental health dialogue systems by integrating multimodal inputs. Our framework uses visual language models to analyze facial expressions and body movements, then combines these visual elements with dialogue context and counseling strategies. This approach allows LLMs to generate more nuanced and supportive responses. The framework comprises four components: in-context learning via computation of semantic similarity; extraction of facial expression descriptions through visual modality data; integration of external knowledge from a knowledge base; and delivery of strategic guidance through a strategy selection module. Both automatic and human evaluations confirm that our approach outperforms existing models, delivering more empathetic, coherent, and contextually relevant mental health support responses.

Type

Journal article

Publication

In Proceedings of the 31th International Conference on Computational Linguistics

Computational Linguistics

Shangfei Wang

Professor of Artificial Intelligence

My research interests include Pattern Recognition, Affective Computing, Probabilistic Graphical Models, Computation Intelligence.